SlideShare une entreprise Scribd logo
1  sur  29
Dec 2011 – LA HUG – Santa Monica, CA
Mahout, CDH3, and Recommendation
Josh Patterson | Sr Solution Architect
Who is Josh Patterson?
• josh@cloudera.com
   – Twitter: @jpatanooga
• Master’s Thesis: self-organizing mesh networks
   – Published in IAAI-09: TinyTermite: A Secure Routing Algorithm
• Conceived, built, and led Hadoop integration for
  openPDC project at Tennessee Valley Authority (TVA)
   – Led team which designed classification techniques for time series and
     Map Reduce
• Open source work at
   – http://openpdc.codeplex.com
   – https://github.com/jpatanooga
• Today
   – Sr. Solutions Architect at Cloudera
Outline

    • Intro to Recommendation
    • Recommendation with Mahout and CDH3




3
“I know I've made some very poor decisions recently, but I
     can give you my complete assurance that my work will be
      back to normal. I've still got the greatest enthusiasm and
            confidence in the mission. And I want to help you. ”
                                              --- HAL from “2001: A Space Odyssey”




                                      Recommendation



4
Information Explosion

    • Amount of data, articles, shows exploding
      – Hard to know what to pay attention to
      – Be nice if it was personalized to my own
        tastes
    • Issues at scale
      – Heap size limits become issue with large
        number of preferences
         • > 1 Billion preferences
      – “real time” recommenders have issues as well
        with scale

5
                        Copyright 2010 Cloudera Inc. All rights reserved
User-based recommendations

• Look for users who share the same ratings
  patterns with the active user
  – looking at the notion of similarity between
    users based on preferences/actions/ratings of
    those users
• So we can recommend the same things to
  similar users
Item-based recommendations

• Item based recommenders are derived from how
  similar items are to items
   – Users who bought X also bought Y
• Compute similarity matrix between items
Item vs User Based

    • Algorithms are similar
      – But not entirely symmetric
    • Item based
      – Scales up as the number of items increases
         • If the number of items is relatively low compared to the
           number of users, performance could be better
      – Items tend to change less than users
    • User based
      – Running time goes up as the number of users
        increases


8
                         Copyright 2010 Cloudera Inc. All rights reserved
Recommendation in Mahout

    • Not a single recommender engine
      – Assortment of components
    • Components can be plugged together and
      customized
      – We target a specific domain with a custom
        built recommender
      – Need experimentation to get good results




9
                     Copyright 2010 Cloudera Inc. All rights reserved
Co-Occurrence Matrix
 • Example:
     – If we have 10 users, and all of them express a preference
       for items A and B
        • A and B are said to co-occur 10 times
 • Can be thought of much like similarity
     – The more we see two items occur together
     – The greater the chance the two items are related
       somehow
 • Producing a Co-Occurrence matrix ends up being a
   simple exercise of counting
     – we compute number of times the pair occurs together
       per user
     – Works well distributed



10
                          Copyright 2010 Cloudera Inc. All rights reserved
Simple Recommender Input
 UserID, ItemID, Rating

 10, 1000, 5.0
 10, 1001, 3.0
 10, 1004, 2.5

 13, 1001, 3.5
 13, 1002, 4.5
 13, 1003, 1.0
 13, 1004, 3.5

 15, 1000, 4.5
 15, 1001, 3.5
 15, 1002, 2.5



11
                          Copyright 2010 Cloudera Inc. All rights reserved
Simple Co-Occurrence Matrix
        1000   1001                          1002                     1003   1004

 1000   2      2                             1                        0      1

 1001   2      3                             2                        1      2

 1002   1      2                             2                        1      1

 1003   0      1                             1                        1      1

 1004   1      2                             1                        1      2




12
                   Copyright 2010 Cloudera Inc. All rights reserved
User’s Preferences as a Vector

 • In other recommendation algos we look at
   users as points in space
     – Euclidean distances as similarity
 • In a data model with n items, user
   preferences are like a vector over n
   dimensions
     – With 1 dimension for each item
     – Creates sparse vector
 • Example
     – User 10: { 5.0, 3.0, 0.0, 0.0, 2.5 }

13
                       Copyright 2010 Cloudera Inc. All rights reserved
Computing Recommendations

 • Multiply the user vector (as column vector)
   vs the co-occurrence matrix
     – User column vector x each item row vector
 • Result: vector whose dimension is equal to
   the number of items
     – Values in results vector R are recommended
       as “best recommendations”




14
                    Copyright 2010 Cloudera Inc. All rights reserved
Calculating R: Example

         1000     1001    1002                1003                1004               UserID       R
 1000    2        2       1                   0                   1
                                                                                     5.0          18.5
 1001    2        3       2                   1                   2
                                                                                     3.0          24
 1002    1        2       2                   1                   1              x   0.0      =   13.5
 1003    0        1       1                   1                   1
                                                                                     0.0          5.5
 1004    1        2       1                   1                   2
                                                                                     2.5          16


 R value for item 1002:

 1 ( 5.0 ) + 2 ( 3.0 ) + 2 ( 0.0 ) + 1 ( 0.0 ) + 1 ( 2.5 ) == 13.5



15
                              Copyright 2010 Cloudera Inc. All rights reserved
Recommendations

 • If a user has already indicated a                                    10, 1000, 5.0
                                                                        10, 1001, 3.0
   preference for an item, we don’t                                     10, 1004, 2.5
   want to recommend it
 • We take the remaining items                                              R

   ranked by their R value                                                  18.5

     – Here it would be 1002 at 13.5                                        24
                                                                            13.5
       • Followed by 1003 at 5.5
                                                                            5.5
                                                                            16



16
                     Copyright 2010 Cloudera Inc. All rights reserved
“Dave Bowman: I don't know; I think so. You know of course
    though he's right about the 9000 series having a perfect
                                operational record. They do.
      Dr. Frank Poole: Unfortunately that sounds a little like
                                        famous last words. ”
                                                 --- “2001:A Space Odyssey”




      Recommendations with Mahout and CDH3u2




17
Step 1: Install CDH3u2

 • Setup CDH3u2
     – https://ccp.cloudera.com/display/CDHDOC/C
       DH3+Quick+Start+Guide
     – Setup in Pseudo-distributed mode for this
       demo if you don’t have a cluster




18
                   Copyright 2010 Cloudera Inc. All rights reserved
Step 2: Install Mahout

 • Setup Apache Mahout with CDH3
     – https://ccp.cloudera.com/display/CDHDOC/M
       ahout+Installation
     – Make sure $JAVA_HOME is set or Mahout will
       complain




19
                   Copyright 2010 Cloudera Inc. All rights reserved
Step 3: Get Grouplens Data
 • Download
     – http://www.grouplens.org/system/files/ml-1m.zip
 • Format
     – UserID::MovieID::Rating::Timestamp
 • where
     –   UsersIDs are integers
     –   MovieIDs are integers
     –   Ratings are 1 through 5 “stars” (integers)
     –   Time stamp is seconds since the epoch
 • Each user has at least 20 ratings


20
                        Copyright 2010 Cloudera Inc. All rights reserved
Step 4: Prep Data

 • This file isn’t exactly how Mahout
   prefers, but this is an easy fix
     – Mahout is looking for a CSV file with lines of
       the form:
        • userID, itemID, value
 • From bash run
     – tr -s ':' ',' < ratings.dat | cut -f1-3 -d, >
       ratings.csv



21
                        Copyright 2010 Cloudera Inc. All rights reserved
Step 5: Generate Recommendations

 • Input to this job is going to be the
   “ratings.csv” file we generated of the format:
     – userID, itemID, value
 • We also want to give it a list of userIDs to
   generate recommendations for
 • Output of the recommendation job will be
   another CSV file with the layout of:
     – userID [ itemID, score, ... ]
     – Represents the userIDs with their recommended
       itemIDs along with the preference scores


22
                     Copyright 2010 Cloudera Inc. All rights reserved
Step 5: Command Line

 • Put ratings file in HDFS
     – Hadoop fs –put ratings.csv [input-hdfs-path]
 • Put user file in HDFS
     – Let’s put “6040” on a single line in a file and put
       that in HDFS
        • hadoop fs -put [my_local_file]
          [user_file_location_in_hdfs]
 • Now we can run the recommender job
     – mahout recommenditembased --input [input-hdfs-
       path] --output [output-hdfs-path] --tempDir [tmp-
       hdfs-path] --usersFile [user_file_location_in_hdfs]

23
                        Copyright 2010 Cloudera Inc. All rights reserved
Take a Look at the Results

 • Cat output of job
     – hadoop fs -cat [output-hdfs-path]/part-r-00000
 • Which should look like:
     –   6040   [1941:5.0,1904:5.0,2859:5.0,3811:5.0,3814:5.0,14:5.0,17:5.0,3795:5.0,3794:5.0,3793:5.0]




24
                                       Copyright 2010 Cloudera Inc. All rights reserved
Questions? (Thank You!)

 • Recommendation Tutorial based on:
     – http://www.cloudera.com/blog/2011/11/recom
       mendation-with-apache-mahout-in-cdh3/
 • Cloudera’s Distribution including Apache
   Hadoop (CDH):
     – http://www.cloudera.com
 • Apache Mahout
     – http://mahout.apache.org



25
More?
 • Look at www.cloudera.com/training to learn more about
   Hadoop
 • Read www.cloudera.com/blog
     • Lots of great use cases.
 • Check out the downloads page at
     • www.cloudera.com/downloads
     • Get your own copy of Cloudera Distribution for Apache Hadoop
       (CDH)
     • Grab Demo VMs, Connectors, other useful tools.

 • Contact Josh with any questions at
     • josh@cloudera.com



26
                           Copyright 2010 Cloudera Inc. All rights reserved
References

 • S. Owen, R. Anil, T. Dunning, E. Friedman:
   Mahout in Action
 • Sarwar et al.: Item-Based Collaborative
   Filtering Recommendation Algorithms
 • Apache Mahout Wiki:
     – http://mahout.apache.org/




27
                    Copyright 2010 Cloudera Inc. All rights reserved
Workflow
 •   Job 1
      –   Preprocess data if needed
 •   Job 2
      –   Create User Vectors
 •   Job 3
      –   Count Users
 •   Job 4
      –   Prune and Transpose
 •   Job 5
      –   RowSimilarityJob
             •   Weights
             •   pairwiseSimilarity
             •   asMatrix
 •   Job 6
      –   Pre Partial Multiply 1
 •   Job 7
      –   Pre Partial Multiply 2
 •   Job 8
      –   Partial Multiply
 •   Job 9




28
                                      Copyright 2010 Cloudera Inc. All rights reserved
Temp Files Generated
 •   countUsers
 •   itemIDIndex
 •   itemUserMatrix
 •   pairwiseSimilarity
 •   partialMultiply
 •   partialMultiply1
 •   partialMultiply2
 •   similarityMatrix
 •   userVectors
 •   weights

29
                    Copyright 2010 Cloudera Inc. All rights reserved

Contenu connexe

Similaire à LA HUG Dec 2011 - Recommendation Talk

Making Tableau Dashboards Shareable
Making Tableau Dashboards ShareableMaking Tableau Dashboards Shareable
Making Tableau Dashboards ShareableSenturus
 
テスト用のプレゼンテーション
テスト用のプレゼンテーションテスト用のプレゼンテーション
テスト用のプレゼンテーションgooseboi
 
Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...
Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...
Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...Data Con LA
 
Trusts You Might Have Missed
Trusts You Might Have MissedTrusts You Might Have Missed
Trusts You Might Have MissedWill Schroeder
 
From Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your OrganizationFrom Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your OrganizationCloudera, Inc.
 
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...DataWorks Summit
 
Our application got popular and now it breaks
Our application got popular and now it breaksOur application got popular and now it breaks
Our application got popular and now it breaksColdFusionConference
 
Our application got popular and now it breaks
Our application got popular and now it breaksOur application got popular and now it breaks
Our application got popular and now it breaksdevObjective
 
How to grow to a modern workplace in 16 steps with microsoft 365
How to grow to a modern workplace in 16 steps with microsoft 365How to grow to a modern workplace in 16 steps with microsoft 365
How to grow to a modern workplace in 16 steps with microsoft 365Tim Hermie ☁️
 
7 Things Testers Should Know About The Cloud with Bill Wilder & XBOSoft March...
7 Things Testers Should Know About The Cloud with Bill Wilder & XBOSoft March...7 Things Testers Should Know About The Cloud with Bill Wilder & XBOSoft March...
7 Things Testers Should Know About The Cloud with Bill Wilder & XBOSoft March...XBOSoft
 
Big Data Expo 2015 - Hortonworks Effective use of Apache Spark
Big Data Expo 2015 - Hortonworks Effective use of Apache SparkBig Data Expo 2015 - Hortonworks Effective use of Apache Spark
Big Data Expo 2015 - Hortonworks Effective use of Apache SparkBigDataExpo
 
A Primer for Your Next Data Science Proof of Concept on the Cloud
A Primer for Your Next Data Science Proof of Concept on the CloudA Primer for Your Next Data Science Proof of Concept on the Cloud
A Primer for Your Next Data Science Proof of Concept on the CloudAlton Alexander
 
CCCC Neustar Lenny Rachitsky
CCCC Neustar Lenny RachitskyCCCC Neustar Lenny Rachitsky
CCCC Neustar Lenny RachitskyCloud Congress
 
(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects
(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects
(SPOT205) 5 Lessons for Managing Massive IT Transformation ProjectsAmazon Web Services
 
EclipseCon 2016 - OCCIware : one Cloud API to rule them all
EclipseCon 2016 - OCCIware : one Cloud API to rule them allEclipseCon 2016 - OCCIware : one Cloud API to rule them all
EclipseCon 2016 - OCCIware : one Cloud API to rule them allMarc Dutoo
 
OCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open Wide
OCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open WideOCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open Wide
OCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open WideOCCIware
 
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...DevClub_lv
 
Web Performance BootCamp 2013
Web Performance BootCamp 2013Web Performance BootCamp 2013
Web Performance BootCamp 2013Daniel Austin
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorialmarkgrover
 

Similaire à LA HUG Dec 2011 - Recommendation Talk (20)

Making Tableau Dashboards Shareable
Making Tableau Dashboards ShareableMaking Tableau Dashboards Shareable
Making Tableau Dashboards Shareable
 
テスト用のプレゼンテーション
テスト用のプレゼンテーションテスト用のプレゼンテーション
テスト用のプレゼンテーション
 
Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...
Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...
Data Con LA 2022 - Supercharge your Snowflake Data Cloud from a Snowflake Dat...
 
Trusts You Might Have Missed
Trusts You Might Have MissedTrusts You Might Have Missed
Trusts You Might Have Missed
 
From Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your OrganizationFrom Insight to Action: Using Data Science to Transform Your Organization
From Insight to Action: Using Data Science to Transform Your Organization
 
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
Dr Elephant: LinkedIn's Self-Service System for Detecting and Treating Hadoop...
 
Our application got popular and now it breaks
Our application got popular and now it breaksOur application got popular and now it breaks
Our application got popular and now it breaks
 
Our application got popular and now it breaks
Our application got popular and now it breaksOur application got popular and now it breaks
Our application got popular and now it breaks
 
How to grow to a modern workplace in 16 steps with microsoft 365
How to grow to a modern workplace in 16 steps with microsoft 365How to grow to a modern workplace in 16 steps with microsoft 365
How to grow to a modern workplace in 16 steps with microsoft 365
 
7 Things Testers Should Know About The Cloud with Bill Wilder & XBOSoft March...
7 Things Testers Should Know About The Cloud with Bill Wilder & XBOSoft March...7 Things Testers Should Know About The Cloud with Bill Wilder & XBOSoft March...
7 Things Testers Should Know About The Cloud with Bill Wilder & XBOSoft March...
 
Big Data Expo 2015 - Hortonworks Effective use of Apache Spark
Big Data Expo 2015 - Hortonworks Effective use of Apache SparkBig Data Expo 2015 - Hortonworks Effective use of Apache Spark
Big Data Expo 2015 - Hortonworks Effective use of Apache Spark
 
A Primer for Your Next Data Science Proof of Concept on the Cloud
A Primer for Your Next Data Science Proof of Concept on the CloudA Primer for Your Next Data Science Proof of Concept on the Cloud
A Primer for Your Next Data Science Proof of Concept on the Cloud
 
CCCC Neustar Lenny Rachitsky
CCCC Neustar Lenny RachitskyCCCC Neustar Lenny Rachitsky
CCCC Neustar Lenny Rachitsky
 
(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects
(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects
(SPOT205) 5 Lessons for Managing Massive IT Transformation Projects
 
EclipseCon 2016 - OCCIware : one Cloud API to rule them all
EclipseCon 2016 - OCCIware : one Cloud API to rule them allEclipseCon 2016 - OCCIware : one Cloud API to rule them all
EclipseCon 2016 - OCCIware : one Cloud API to rule them all
 
OCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open Wide
OCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open WideOCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open Wide
OCCIware Project at EclipseCon France 2016, by Marc Dutoo, Open Wide
 
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
“Machine Learning in Production + Case Studies” by Dmitrijs Lvovs from Epista...
 
Web Performance BootCamp 2013
Web Performance BootCamp 2013Web Performance BootCamp 2013
Web Performance BootCamp 2013
 
Intro to hadoop tutorial
Intro to hadoop tutorialIntro to hadoop tutorial
Intro to hadoop tutorial
 
Mongo db ops mug pres
Mongo db ops mug presMongo db ops mug pres
Mongo db ops mug pres
 

Plus de Josh Patterson

Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?Josh Patterson
 
What is Artificial Intelligence
What is Artificial IntelligenceWhat is Artificial Intelligence
What is Artificial IntelligenceJosh Patterson
 
Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecSmart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecJosh Patterson
 
Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecJosh Patterson
 
Deep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseDeep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseJosh Patterson
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksJosh Patterson
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JJosh Patterson
 
How to Build Deep Learning Models
How to Build Deep Learning ModelsHow to Build Deep Learning Models
How to Build Deep Learning ModelsJosh Patterson
 
Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015Josh Patterson
 
Enterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JEnterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JJosh Patterson
 
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015Josh Patterson
 
Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015Josh Patterson
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Josh Patterson
 
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4JGeorgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4JJosh Patterson
 
Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242Josh Patterson
 
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopHadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopJosh Patterson
 
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNMLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNJosh Patterson
 
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNJosh Patterson
 
Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2Josh Patterson
 
Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012Josh Patterson
 

Plus de Josh Patterson (20)

Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?
 
What is Artificial Intelligence
What is Artificial IntelligenceWhat is Artificial Intelligence
What is Artificial Intelligence
 
Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecSmart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVec
 
Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVec
 
Deep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the EnterpriseDeep Learning and Recurrent Neural Networks in the Enterprise
Deep Learning and Recurrent Neural Networks in the Enterprise
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural Networks
 
Building Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4JBuilding Deep Learning Workflows with DL4J
Building Deep Learning Workflows with DL4J
 
How to Build Deep Learning Models
How to Build Deep Learning ModelsHow to Build Deep Learning Models
How to Build Deep Learning Models
 
Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015Deep learning with DL4J - Hadoop Summit 2015
Deep learning with DL4J - Hadoop Summit 2015
 
Enterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4JEnterprise Deep Learning with DL4J
Enterprise Deep Learning with DL4J
 
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
Deep Learning Intro - Georgia Tech - CSE6242 - March 2015
 
Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015Vectorization - Georgia Tech - CSE6242 - March 2015
Vectorization - Georgia Tech - CSE6242 - March 2015
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
 
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4JGeorgia Tech cse6242 - Intro to Deep Learning and DL4J
Georgia Tech cse6242 - Intro to Deep Learning and DL4J
 
Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242Intro to Vectorization Concepts - GaTech cse6242
Intro to Vectorization Concepts - GaTech cse6242
 
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopHadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
 
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNMLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
 
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
 
Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2
 
Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012
 

Dernier

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsNathaniel Shimoni
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 

Dernier (20)

What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Time Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directionsTime Series Foundation Models - current state and future directions
Time Series Foundation Models - current state and future directions
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 

LA HUG Dec 2011 - Recommendation Talk

  • 1. Dec 2011 – LA HUG – Santa Monica, CA Mahout, CDH3, and Recommendation Josh Patterson | Sr Solution Architect
  • 2. Who is Josh Patterson? • josh@cloudera.com – Twitter: @jpatanooga • Master’s Thesis: self-organizing mesh networks – Published in IAAI-09: TinyTermite: A Secure Routing Algorithm • Conceived, built, and led Hadoop integration for openPDC project at Tennessee Valley Authority (TVA) – Led team which designed classification techniques for time series and Map Reduce • Open source work at – http://openpdc.codeplex.com – https://github.com/jpatanooga • Today – Sr. Solutions Architect at Cloudera
  • 3. Outline • Intro to Recommendation • Recommendation with Mahout and CDH3 3
  • 4. “I know I've made some very poor decisions recently, but I can give you my complete assurance that my work will be back to normal. I've still got the greatest enthusiasm and confidence in the mission. And I want to help you. ” --- HAL from “2001: A Space Odyssey” Recommendation 4
  • 5. Information Explosion • Amount of data, articles, shows exploding – Hard to know what to pay attention to – Be nice if it was personalized to my own tastes • Issues at scale – Heap size limits become issue with large number of preferences • > 1 Billion preferences – “real time” recommenders have issues as well with scale 5 Copyright 2010 Cloudera Inc. All rights reserved
  • 6. User-based recommendations • Look for users who share the same ratings patterns with the active user – looking at the notion of similarity between users based on preferences/actions/ratings of those users • So we can recommend the same things to similar users
  • 7. Item-based recommendations • Item based recommenders are derived from how similar items are to items – Users who bought X also bought Y • Compute similarity matrix between items
  • 8. Item vs User Based • Algorithms are similar – But not entirely symmetric • Item based – Scales up as the number of items increases • If the number of items is relatively low compared to the number of users, performance could be better – Items tend to change less than users • User based – Running time goes up as the number of users increases 8 Copyright 2010 Cloudera Inc. All rights reserved
  • 9. Recommendation in Mahout • Not a single recommender engine – Assortment of components • Components can be plugged together and customized – We target a specific domain with a custom built recommender – Need experimentation to get good results 9 Copyright 2010 Cloudera Inc. All rights reserved
  • 10. Co-Occurrence Matrix • Example: – If we have 10 users, and all of them express a preference for items A and B • A and B are said to co-occur 10 times • Can be thought of much like similarity – The more we see two items occur together – The greater the chance the two items are related somehow • Producing a Co-Occurrence matrix ends up being a simple exercise of counting – we compute number of times the pair occurs together per user – Works well distributed 10 Copyright 2010 Cloudera Inc. All rights reserved
  • 11. Simple Recommender Input UserID, ItemID, Rating 10, 1000, 5.0 10, 1001, 3.0 10, 1004, 2.5 13, 1001, 3.5 13, 1002, 4.5 13, 1003, 1.0 13, 1004, 3.5 15, 1000, 4.5 15, 1001, 3.5 15, 1002, 2.5 11 Copyright 2010 Cloudera Inc. All rights reserved
  • 12. Simple Co-Occurrence Matrix 1000 1001 1002 1003 1004 1000 2 2 1 0 1 1001 2 3 2 1 2 1002 1 2 2 1 1 1003 0 1 1 1 1 1004 1 2 1 1 2 12 Copyright 2010 Cloudera Inc. All rights reserved
  • 13. User’s Preferences as a Vector • In other recommendation algos we look at users as points in space – Euclidean distances as similarity • In a data model with n items, user preferences are like a vector over n dimensions – With 1 dimension for each item – Creates sparse vector • Example – User 10: { 5.0, 3.0, 0.0, 0.0, 2.5 } 13 Copyright 2010 Cloudera Inc. All rights reserved
  • 14. Computing Recommendations • Multiply the user vector (as column vector) vs the co-occurrence matrix – User column vector x each item row vector • Result: vector whose dimension is equal to the number of items – Values in results vector R are recommended as “best recommendations” 14 Copyright 2010 Cloudera Inc. All rights reserved
  • 15. Calculating R: Example 1000 1001 1002 1003 1004 UserID R 1000 2 2 1 0 1 5.0 18.5 1001 2 3 2 1 2 3.0 24 1002 1 2 2 1 1 x 0.0 = 13.5 1003 0 1 1 1 1 0.0 5.5 1004 1 2 1 1 2 2.5 16 R value for item 1002: 1 ( 5.0 ) + 2 ( 3.0 ) + 2 ( 0.0 ) + 1 ( 0.0 ) + 1 ( 2.5 ) == 13.5 15 Copyright 2010 Cloudera Inc. All rights reserved
  • 16. Recommendations • If a user has already indicated a 10, 1000, 5.0 10, 1001, 3.0 preference for an item, we don’t 10, 1004, 2.5 want to recommend it • We take the remaining items R ranked by their R value 18.5 – Here it would be 1002 at 13.5 24 13.5 • Followed by 1003 at 5.5 5.5 16 16 Copyright 2010 Cloudera Inc. All rights reserved
  • 17. “Dave Bowman: I don't know; I think so. You know of course though he's right about the 9000 series having a perfect operational record. They do. Dr. Frank Poole: Unfortunately that sounds a little like famous last words. ” --- “2001:A Space Odyssey” Recommendations with Mahout and CDH3u2 17
  • 18. Step 1: Install CDH3u2 • Setup CDH3u2 – https://ccp.cloudera.com/display/CDHDOC/C DH3+Quick+Start+Guide – Setup in Pseudo-distributed mode for this demo if you don’t have a cluster 18 Copyright 2010 Cloudera Inc. All rights reserved
  • 19. Step 2: Install Mahout • Setup Apache Mahout with CDH3 – https://ccp.cloudera.com/display/CDHDOC/M ahout+Installation – Make sure $JAVA_HOME is set or Mahout will complain 19 Copyright 2010 Cloudera Inc. All rights reserved
  • 20. Step 3: Get Grouplens Data • Download – http://www.grouplens.org/system/files/ml-1m.zip • Format – UserID::MovieID::Rating::Timestamp • where – UsersIDs are integers – MovieIDs are integers – Ratings are 1 through 5 “stars” (integers) – Time stamp is seconds since the epoch • Each user has at least 20 ratings 20 Copyright 2010 Cloudera Inc. All rights reserved
  • 21. Step 4: Prep Data • This file isn’t exactly how Mahout prefers, but this is an easy fix – Mahout is looking for a CSV file with lines of the form: • userID, itemID, value • From bash run – tr -s ':' ',' < ratings.dat | cut -f1-3 -d, > ratings.csv 21 Copyright 2010 Cloudera Inc. All rights reserved
  • 22. Step 5: Generate Recommendations • Input to this job is going to be the “ratings.csv” file we generated of the format: – userID, itemID, value • We also want to give it a list of userIDs to generate recommendations for • Output of the recommendation job will be another CSV file with the layout of: – userID [ itemID, score, ... ] – Represents the userIDs with their recommended itemIDs along with the preference scores 22 Copyright 2010 Cloudera Inc. All rights reserved
  • 23. Step 5: Command Line • Put ratings file in HDFS – Hadoop fs –put ratings.csv [input-hdfs-path] • Put user file in HDFS – Let’s put “6040” on a single line in a file and put that in HDFS • hadoop fs -put [my_local_file] [user_file_location_in_hdfs] • Now we can run the recommender job – mahout recommenditembased --input [input-hdfs- path] --output [output-hdfs-path] --tempDir [tmp- hdfs-path] --usersFile [user_file_location_in_hdfs] 23 Copyright 2010 Cloudera Inc. All rights reserved
  • 24. Take a Look at the Results • Cat output of job – hadoop fs -cat [output-hdfs-path]/part-r-00000 • Which should look like: – 6040 [1941:5.0,1904:5.0,2859:5.0,3811:5.0,3814:5.0,14:5.0,17:5.0,3795:5.0,3794:5.0,3793:5.0] 24 Copyright 2010 Cloudera Inc. All rights reserved
  • 25. Questions? (Thank You!) • Recommendation Tutorial based on: – http://www.cloudera.com/blog/2011/11/recom mendation-with-apache-mahout-in-cdh3/ • Cloudera’s Distribution including Apache Hadoop (CDH): – http://www.cloudera.com • Apache Mahout – http://mahout.apache.org 25
  • 26. More? • Look at www.cloudera.com/training to learn more about Hadoop • Read www.cloudera.com/blog • Lots of great use cases. • Check out the downloads page at • www.cloudera.com/downloads • Get your own copy of Cloudera Distribution for Apache Hadoop (CDH) • Grab Demo VMs, Connectors, other useful tools. • Contact Josh with any questions at • josh@cloudera.com 26 Copyright 2010 Cloudera Inc. All rights reserved
  • 27. References • S. Owen, R. Anil, T. Dunning, E. Friedman: Mahout in Action • Sarwar et al.: Item-Based Collaborative Filtering Recommendation Algorithms • Apache Mahout Wiki: – http://mahout.apache.org/ 27 Copyright 2010 Cloudera Inc. All rights reserved
  • 28. Workflow • Job 1 – Preprocess data if needed • Job 2 – Create User Vectors • Job 3 – Count Users • Job 4 – Prune and Transpose • Job 5 – RowSimilarityJob • Weights • pairwiseSimilarity • asMatrix • Job 6 – Pre Partial Multiply 1 • Job 7 – Pre Partial Multiply 2 • Job 8 – Partial Multiply • Job 9 28 Copyright 2010 Cloudera Inc. All rights reserved
  • 29. Temp Files Generated • countUsers • itemIDIndex • itemUserMatrix • pairwiseSimilarity • partialMultiply • partialMultiply1 • partialMultiply2 • similarityMatrix • userVectors • weights 29 Copyright 2010 Cloudera Inc. All rights reserved

Notes de l'éditeur

  1. Its all about the love, baby.
  2. Theme: they through away a lot of valuable gas and oil just like we through away data today
  3. Do we want to do a quick slide on types of attributes?Nominal: “sunny”,”overcast”, and “rainy”Ordinal: like nominal, but with orderInterval: “year”, “temp”, expressed in fixed and equal unitsRatio: scheme defines a zero point, example: “distance”, treated as real numbers
  4. If an item co-occurs with another item a user prefersIts probably going to be an item the user will be interested in
  5. The dot-product formula sums the products of co-occurrences and preference valuesWhen an item’s co-occurrences overlap more often with highly preferred items the sum ends up being larger
  6. Theme: they through away a lot of valuable gas and oil just like we through away data today
  7. RowSimilarityJob -&gt;SimilarityType -&gt; DistributedCooccurrenceVectorSimilarity