SlideShare une entreprise Scribd logo
1  sur  48
Télécharger pour lire hors ligne
Scientific Article
 Recommendation
      with Mahout




           Kris Jack, PhD
Senior Data Mining Engineer
Use Case
➔
    Good researchers are on top of their game
➔
    Large amount of research produced
➔
    Takes time to get at what you need




➔
    Help researchers by recommending relevant research
1.5 million+ users; the 20 largest user bases:
                            University of Cambridge
                                 Stanford University
                                                   MIT
                                 University of Michigan
                                       Harvard University
                                       University of Oxford
                                      Sao Paulo University
                                    Imperial College London
                                      University of Edinburgh
                                            Cornell University
                              University of California at Berkeley
                                                      RWTH Aachen
                                               Columbia University
                                                           Georgia Tech
                                               University of Wisconsin
                                                            UC San Diego
                                              University of California at LA
                                                        University of Florida

50m research articles                              University of North Carolina
1.5 million+ users; the 20 largest user bases:
                            University of Cambridge
                                 Stanford University
                                                   MIT
                                 University of Michigan
        We need a                      Harvard University
                                       University of Oxford
    recommender that                  Sao Paulo University
  scales up, coping with            Imperial College London
                                      University of Edinburgh
   our data and future                      Cornell University
                              University of California at Berkeley
          growth                                      RWTH Aachen
                                               Columbia University
                                                           Georgia Tech
                                               University of Wisconsin
                                                            UC San Diego
                                              University of California at LA
                                                        University of Florida

50m research articles                              University of North Carolina
Questions

➔
    How does Mahout's recommender work?

➔
    How well does it perform out of the box?

➔
    How well does it perform after some tuning?
Mahout's
Recommender
Generating recommendations
through matrix multiplication

                                  This is item-based
                                  recommendations as
                                  similarity is based on
                                  items, not users

org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
Researchers
                                      Turing Babbage Einstein   Newton




                    Comp Sci 1
Research Articles



                    Comp Sci 2



                      Physics 1



                      Physics 2



                                  Input (all user preferences)
Researchers
                                      Turing Babbage Einstein   Newton
                                                                         1.5M



                    Comp Sci 1
Research Articles



                    Comp Sci 2



                      Physics 1



                      Physics 2
                                                                          300M
                                                                          prefs

                                   50M

                                  Input (all user preferences)
Researchers




                               Research
                               Articles
item.RecommenderJob
 1. Prep. pref. matrix (1-3)
 2. Gen. sim. matrix (4-6)
 3. Multiply matrices (7-10)              All User Preferences
                                              (item x user)
Researchers




                                   Research
                                   Articles
item.RecommenderJob
 1. Prep. pref. matrix (1-3)
 2. Gen. sim. matrix (4-6)
 3. Multiply matrices (7-10)                  All User Preferences
                                                  (item x user)




                               Research       Turing
                               Articles




                               A User's Preferences
                                  (item x user)
Researchers




                                    Research
                                    Articles
item.RecommenderJob
  1. Prep. pref. matrix (1-3)
  2. Gen. sim. matrix (4-6)
  3. Multiply matrices (7-10)                  All User Preferences
                                                   (item x user)


                Research
                Articles                       Turing


            2   1    0     0
                                Research
Research




                     0     0
                                Articles


            1   1
Articles




            0   0    2     2
            0   0    2     2
           Item Similarity      A User's Preferences
            (item x item)          (item x user)
Researchers




                                       Research
                                       Articles
item.RecommenderJob
  1. Prep. pref. matrix (1-3)
  2. Gen. sim. matrix (4-6)
  3. Multiply matrices (7-10)                     All User Preferences
                                                      (item x user)


                Research
                Articles                          Turing                       Turing


            2   1    0     0
                                   Research




                                                                    Research
Research




                     0     0
                                   Articles




                                                                    Articles
            1   1
Articles




            0   0    2     2   X                             =
            0   0    2     2
           Item Similarity         A User's Preferences               Recommendations
            (item x item)             (item x user)                     (item x user)
How well does
     it work?
Mendeley Suggest
Running on Amazon's Elastic Map Reduce




                On demand use and easy to cost
Mahout's
Normalised Amazon Hours          Performance




                          No. Good Recommendations/10
Mahout's
               Costly & Bad
Normalised Amazon Hours              Performance            Costly & Good




           Cheap & Bad        No. Good Recommendations/10   Cheap & Good
Mahout's
               Costly & Bad
Normalised Amazon Hours              Performance            Costly & Good




           Cheap & Bad        No. Good Recommendations/10   Cheap & Good
Mahout's
               Costly & Bad
Normalised Amazon Hours              Performance            Costly & Good




           Cheap & Bad        No. Good Recommendations/10   Cheap & Good
Mahout's
               Costly & Bad        Performance           Costly & Good
                          7K
Normalised Amazon Hours


                          6K

                          5K

                          4K

                          3K

                          2K

                          1K

                           0
                       0.5     0
                               1      1.5   2      2.5         3
           Cheap & Bad   No. Good Recommendations/10     Cheap & Good
Mahout's
               Costly & Bad          Performance         Costly & Good
                          7K
                                   6.5K, 1.5
Normalised Amazon Hours


                          6K       Orig. item-based


                          5K

                          4K

                          3K

                          2K

                          1K

                           0
                       0.5     0
                               1      1.5   2      2.5         3
           Cheap & Bad   No. Good Recommendations/10     Cheap & Good
Let's tune it!
1. Reduce processing time

2. Improve quality
1. Reduce processing time
➔
    Mahout's recommender is already efficient
➔
    But your data may have unusual properties
➔
    Hadoop may need a helping hand
➔
    Let's see what's going on...
Task Allocation              37 hours to complete




    1 reducer allocated, despite having 48 available...
Task Allocation

Allocating more reducers on a per job basis

                job.getConfiguration().setInt(
                    "mapred.reduce.tasks",
                    numReducers);



Allocating more mappers on a per job basis

                job.getConfiguration().set(
                    "mapred.max.split.size",
                    String.valueOf(splitSize));
Task Allocation   37 hours to complete
                      14 hours




                      From 1 → 40
                      reducers
Partitioners   14 hours to complete
Partitioners   14 hours to complete

                                      ~50KB




                            ~500MB
InputSampler.Sampler<IntWritable, Text> sampler =
      new InputSampler.RandomSampler<IntWritable, Text>(...);
  InputSampler.writePartitionFile(conf, sampler);
  conf.setPartitionerClass(TotalOrderPartitioner.class);




http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial-
series-issue-2-getting-started-with-customized-partitioning/
Partitioners        14 hours to complete
                   2 hours




               Evenly
               distributed
Mahout's
               Costly & Bad          Performance         Costly & Good
                          7K
                                   6.5K, 1.5
Normalised Amazon Hours


                          6K       Orig. item-based


                          5K

                          4K

                          3K

                          2K

                          1K

                           0
                       0.5     0
                               1      1.5   2      2.5         3
           Cheap & Bad   No. Good Recommendations/10     Cheap & Good
Mahout's
               Costly & Bad              Performance      Costly & Good
                          7K
                                       6.5K, 1.5
Normalised Amazon Hours


                          6K           Orig. item-based


                          5K

                          4K

                          3K           Cust. item-based
                                   ➔
                                       2.4K, 1.5
                          2K

                          1K

                           0
                       0.5     0
                               1      1.5   2      2.5          3
           Cheap & Bad   No. Good Recommendations/10      Cheap & Good
Mahout's
               Costly & Bad              Performance              Costly & Good
                          7K
                                       6.5K, 1.5
Normalised Amazon Hours


                          6K           Orig. item-based


                          5K
                                                          -4.1K
                                                          (63%)
                          4K

                          3K           Cust. item-based
                                   ➔
                                       2.4K, 1.5
                          2K

                          1K

                           0
                       0.5     0
                               1      1.5   2      2.5                  3
           Cheap & Bad   No. Good Recommendations/10              Cheap & Good
2. Improve quality
➔
    Mahout provides item-based CF
➔
    We have many more items than users
➔
    Typically, user-based is more appropriate
    ➔
        So let's make one!
Researchers




                                       Research
                                       Articles
item.RecommenderJob
  1. Prep. pref. matrix (1-3)
  2. Gen. sim. matrix (4-6)
  3. Multiply matrices (7-10)                     All User Preferences
                                                      (item x user)


                Research
                Articles                          Turing                       Turing


            2   1    0     0
                                   Research




                                                                    Research
Research




                     0     0
                                   Articles




                                                                    Articles
            1   1
Articles




            0   0    2     2   X                             =
            0   0    2     2
           Item Similarity         A User's Preferences               Recommendations
            (item x item)             (item x user)                     (item x user)
Researchers


   user




                                         Research
                                         Articles
   item.RecommenderJob
      1. Prep. pref. matrix (1-3)
      2. Gen. sim. matrix (4-6)
      3. Multiply matrices (7-10)                   All User Preferences
                                                        (item x user)

                Researchers
                  Research
                  Articles                          Turing                       Turing


               2   1    0   0
Researchers




                                     Research




                                                                      Research
  Research




                        0   0
                                     Articles




                                                                      Articles
               1   1
  Articles




               0   0    2   2   X                              =
               0   0    2   2
              Item Similarity        A User's Preferences               Recommendations
               (item x item)            (item x user)                     (item x user)
     User Similarity (user x user)
Mahout's
               Costly & Bad              Performance      Costly & Good
                          7K
                                       6.5K, 1.5
Normalised Amazon Hours


                          6K           Orig. item-based


                          5K

                          4K

                          3K           Cust. item-based
                                   ➔
                                       2.4K, 1.5
                          2K

                          1K

                           0
                       0.5     0
                               1      1.5   2      2.5          3
           Cheap & Bad   No. Good Recommendations/10      Cheap & Good
Mahout's
               Costly & Bad              Performance                        Costly & Good
                          7K
                                       6.5K, 1.5
Normalised Amazon Hours


                          6K           Orig. item-based


                          5K

                          4K

                          3K           Cust. item-based
                                   ➔
                                       2.4K, 1.5
                          2K
                                                              Orig. user-based
                          1K
                                                          ➔
                                                              1K, 2.5


                           0
                       0.5     0
                               1      1.5   2      2.5                            3
           Cheap & Bad   No. Good Recommendations/10                         Cheap & Good
Mahout's
               Costly & Bad              Performance                        Costly & Good
                          7K
                                       6.5K, 1.5
Normalised Amazon Hours


                          6K           Orig. item-based


                          5K

                          4K

                          3K           Cust. item-based
                                                          +1 (67%)
                                   ➔
                                       2.4K, 1.5
                          2K              -1.4K
                                                              Orig. user-based
                                          (58%)
                          1K
                                                          ➔
                                                              1K, 2.5


                           0
                       0.5     0
                               1      1.5   2      2.5                            3
           Cheap & Bad   No. Good Recommendations/10                         Cheap & Good
Mahout's
               Costly & Bad              Performance                      Costly & Good
                          7K
                                       6.5K, 1.5
Normalised Amazon Hours


                          6K           Orig. item-based


                          5K

                          4K

                          3K           Cust. item-based
                                   ➔
                                       2.4K, 1.5
                          2K
                                                            Orig. user-based
                          1K
                                                          ➔
                                                            1K, 2.5
                                                            Cust. user-based
                                                          ➔
                                                            0.3K, 2.5
                           0
                       0.5     0
                               1      1.5   2      2.5                          3
           Cheap & Bad   No. Good Recommendations/10                       Cheap & Good
Mahout's
               Costly & Bad              Performance                   Costly & Good
                          7K
                                       6.5K, 1.5
Normalised Amazon Hours


                          6K           Orig. item-based


                          5K
                                                          -4.1K
                                                          (63%)
                          4K

                          3K           Cust. item-based
                                   ➔
                                       2.4K, 1.5
                          2K
                                                         Orig. user-based
                          1K                             1K, 2.5
                                                           ➔


                                                  -0.7K  Cust. user-based
                                                  (70%) ➔0.3K, 2.5
                           0
                       0.5     0
                               1      1.5   2      2.5                       3
           Cheap & Bad   No. Good Recommendations/10                    Cheap & Good
Mahout's
               Costly & Bad              Performance                      Costly & Good
                          7K                              +1 (67%)
                                       6.5K, 1.5
Normalised Amazon Hours


                          6K           Orig. item-based


                          5K

                          4K
                                                                     -6.2K
                                                                     (95%)
                          3K           Cust. item-based
                                   ➔
                                       2.4K, 1.5
                          2K
                                                            Orig. user-based
                          1K
                                                          ➔
                                                            1K, 2.5
                                                            Cust. user-based
                                                          ➔
                                                            0.3K, 2.5
                           0
                       0.5     0
                               1      1.5   2      2.5                          3
           Cheap & Bad   No. Good Recommendations/10                       Cheap & Good
Conclusions
Conclusions
➔
    Mahout is doing a great job of powering Mendeley Suggest
    ➔
        Large scale data set
    ➔
        Good quality recommendations
➔
    Tuning helps
    ➔
        Help Hadoop with task allocation if necessary
    ➔
        Partition your data appropriately
    ➔
        We save 95% resources
➔
    Use an appropriate algorithm
    ➔
        Item- vs user-based (MAHOUT-1004)
    ➔
        We increase precision by 66.6%
Mahout's
               Costly & Bad                         Performance                      Costly & Good
                          7K                                         +1 (67%)
                                                  6.5K, 1.5
Normalised Amazon Hours


                          6K                      Orig. item-based


                          5K

                          4K
                                                                                -6.2K
                                                                                (95%)
                          3K                      Cust. item-based
                                              ➔
                                                  2.4K, 1.5
                          2K
                                                                       Orig. user-based
                          1K
                                                                     ➔
                                                                       1K, 2.5
                                                                       Cust. user-based
                                                                     ➔
                                                                       0.3K, 2.5
                           0
                       0.5     0
                               1      1.5   2      2.5                                     3
           Cheap & Bad   No. Good Recommendations/10                                  Cheap & Good

                                   http://www.mendeley.com/profiles/kris-jack/

Contenu connexe

Similaire à Scientific Article Recommendation with Mahout

Mendeley: Recommendation Systems for Academic Literature
Mendeley: Recommendation Systems for Academic LiteratureMendeley: Recommendation Systems for Academic Literature
Mendeley: Recommendation Systems for Academic LiteratureKris Jack
 
Mendeley, putting data into the hands of researchers
Mendeley, putting data into the hands of researchersMendeley, putting data into the hands of researchers
Mendeley, putting data into the hands of researchersKris Jack
 
Cyberinfrastructure Day 2010: Applications in Biocomputing
Cyberinfrastructure Day 2010: Applications in BiocomputingCyberinfrastructure Day 2010: Applications in Biocomputing
Cyberinfrastructure Day 2010: Applications in BiocomputingJeremy Yang
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and KnowledgeIan Foster
 
Can machines understand the scientific literature
Can machines understand the scientific literatureCan machines understand the scientific literature
Can machines understand the scientific literaturepetermurrayrust
 
Foundations for the Future of Science
Foundations for the Future of ScienceFoundations for the Future of Science
Foundations for the Future of ScienceGlobus
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for ScienceIan Foster
 
DLint: dynamically checking bad coding practices in JavaScript (ISSTA'15 Slides)
DLint: dynamically checking bad coding practices in JavaScript (ISSTA'15 Slides)DLint: dynamically checking bad coding practices in JavaScript (ISSTA'15 Slides)
DLint: dynamically checking bad coding practices in JavaScript (ISSTA'15 Slides)Liang Gong
 
Introduction to Next Generation Sequencing
Introduction to Next Generation SequencingIntroduction to Next Generation Sequencing
Introduction to Next Generation SequencingEdizonJambormias2
 
BITS: Basics of sequence databases
BITS: Basics of sequence databasesBITS: Basics of sequence databases
BITS: Basics of sequence databasesBITS
 
Making Linked Data SPARQL with the InterMine Biological Data Warehouse
Making Linked Data SPARQL with the InterMine Biological Data WarehouseMaking Linked Data SPARQL with the InterMine Biological Data Warehouse
Making Linked Data SPARQL with the InterMine Biological Data WarehouseJustin Clark-Casey
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsAnubhav Jain
 
Preserving the Inputs and Outputs of Scholarship
Preserving the Inputs and Outputs of ScholarshipPreserving the Inputs and Outputs of Scholarship
Preserving the Inputs and Outputs of Scholarshiptsbbbu
 
生態文化ニッチモデリングによる分布推定
生態文化ニッチモデリングによる分布推定生態文化ニッチモデリングによる分布推定
生態文化ニッチモデリングによる分布推定Yasuhisa Kondo
 

Similaire à Scientific Article Recommendation with Mahout (20)

Mendeley: Recommendation Systems for Academic Literature
Mendeley: Recommendation Systems for Academic LiteratureMendeley: Recommendation Systems for Academic Literature
Mendeley: Recommendation Systems for Academic Literature
 
Mendeley, putting data into the hands of researchers
Mendeley, putting data into the hands of researchersMendeley, putting data into the hands of researchers
Mendeley, putting data into the hands of researchers
 
Cyberinfrastructure Day 2010: Applications in Biocomputing
Cyberinfrastructure Day 2010: Applications in BiocomputingCyberinfrastructure Day 2010: Applications in Biocomputing
Cyberinfrastructure Day 2010: Applications in Biocomputing
 
Computation and Knowledge
Computation and KnowledgeComputation and Knowledge
Computation and Knowledge
 
Multiscale Modeling
Multiscale ModelingMultiscale Modeling
Multiscale Modeling
 
Can machines understand the scientific literature
Can machines understand the scientific literatureCan machines understand the scientific literature
Can machines understand the scientific literature
 
Foundations for the Future of Science
Foundations for the Future of ScienceFoundations for the Future of Science
Foundations for the Future of Science
 
2015_CV_J_SHELTON_linked
2015_CV_J_SHELTON_linked2015_CV_J_SHELTON_linked
2015_CV_J_SHELTON_linked
 
2013-01-17 Research Object
2013-01-17 Research Object2013-01-17 Research Object
2013-01-17 Research Object
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 
Learning Systems for Science
Learning Systems for ScienceLearning Systems for Science
Learning Systems for Science
 
Bme451 Fall07 Final
Bme451 Fall07 FinalBme451 Fall07 Final
Bme451 Fall07 Final
 
DLint: dynamically checking bad coding practices in JavaScript (ISSTA'15 Slides)
DLint: dynamically checking bad coding practices in JavaScript (ISSTA'15 Slides)DLint: dynamically checking bad coding practices in JavaScript (ISSTA'15 Slides)
DLint: dynamically checking bad coding practices in JavaScript (ISSTA'15 Slides)
 
Introduction to Next Generation Sequencing
Introduction to Next Generation SequencingIntroduction to Next Generation Sequencing
Introduction to Next Generation Sequencing
 
BITS: Basics of sequence databases
BITS: Basics of sequence databasesBITS: Basics of sequence databases
BITS: Basics of sequence databases
 
Making Linked Data SPARQL with the InterMine Biological Data Warehouse
Making Linked Data SPARQL with the InterMine Biological Data WarehouseMaking Linked Data SPARQL with the InterMine Biological Data Warehouse
Making Linked Data SPARQL with the InterMine Biological Data Warehouse
 
2015 illinois-talk
2015 illinois-talk2015 illinois-talk
2015 illinois-talk
 
Open-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data setsOpen-source tools for generating and analyzing large materials data sets
Open-source tools for generating and analyzing large materials data sets
 
Preserving the Inputs and Outputs of Scholarship
Preserving the Inputs and Outputs of ScholarshipPreserving the Inputs and Outputs of Scholarship
Preserving the Inputs and Outputs of Scholarship
 
生態文化ニッチモデリングによる分布推定
生態文化ニッチモデリングによる分布推定生態文化ニッチモデリングによる分布推定
生態文化ニッチモデリングによる分布推定
 

Plus de Kris Jack

Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyKris Jack
 
Machine Learning @ Mendeley
Machine Learning @ MendeleyMachine Learning @ Mendeley
Machine Learning @ MendeleyKris Jack
 
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Kris Jack
 
Mendeley Suggest: What will you read next?
Mendeley Suggest: What will you read next?Mendeley Suggest: What will you read next?
Mendeley Suggest: What will you read next?Kris Jack
 
Mendeley Suggest: Engineering a Personalised Article Recommender System
Mendeley Suggest: Engineering a Personalised Article Recommender SystemMendeley Suggest: Engineering a Personalised Article Recommender System
Mendeley Suggest: Engineering a Personalised Article Recommender SystemKris Jack
 
Mendeley's Data and Perspectives on Data Challenges
Mendeley's Data and Perspectives on Data ChallengesMendeley's Data and Perspectives on Data Challenges
Mendeley's Data and Perspectives on Data ChallengesKris Jack
 
Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley
Mahout Becomes a Researcher: Large Scale Recommendations at MendeleyMahout Becomes a Researcher: Large Scale Recommendations at Mendeley
Mahout Becomes a Researcher: Large Scale Recommendations at MendeleyKris Jack
 
improving explicit preference entry by visualising data similarities
improving explicit preference entry by visualising data similaritiesimproving explicit preference entry by visualising data similarities
improving explicit preference entry by visualising data similaritiesKris Jack
 
Etude de la pertinence de critères de recherche en recherche d'informations s...
Etude de la pertinence de critères de recherche en recherche d'informations s...Etude de la pertinence de critères de recherche en recherche d'informations s...
Etude de la pertinence de critères de recherche en recherche d'informations s...Kris Jack
 
A Computational Model of Staged Language Acquisition
A Computational Model of Staged Language AcquisitionA Computational Model of Staged Language Acquisition
A Computational Model of Staged Language AcquisitionKris Jack
 
From Syllables to Syntax: Investigating Staged Linguistic Development through...
From Syllables to Syntax: Investigating Staged Linguistic Development through...From Syllables to Syntax: Investigating Staged Linguistic Development through...
From Syllables to Syntax: Investigating Staged Linguistic Development through...Kris Jack
 
A Collaborative Tool for the Computational Modelling of Child Language Acquis...
A Collaborative Tool for the Computational Modelling of Child Language Acquis...A Collaborative Tool for the Computational Modelling of Child Language Acquis...
A Collaborative Tool for the Computational Modelling of Child Language Acquis...Kris Jack
 
Mendeley: crowdsourcing and recommending research on a large scale
Mendeley: crowdsourcing and recommending research on a large scaleMendeley: crowdsourcing and recommending research on a large scale
Mendeley: crowdsourcing and recommending research on a large scaleKris Jack
 
Recommendation Engines for Scientific Literature
Recommendation Engines for Scientific LiteratureRecommendation Engines for Scientific Literature
Recommendation Engines for Scientific LiteratureKris Jack
 
Cloud Elephants and Witches: A Big Data Tale from Mendeley
Cloud Elephants and Witches: A Big Data Tale from MendeleyCloud Elephants and Witches: A Big Data Tale from Mendeley
Cloud Elephants and Witches: A Big Data Tale from MendeleyKris Jack
 

Plus de Kris Jack (15)

Modern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in MendeleyModern Perspectives on Recommender Systems and their Applications in Mendeley
Modern Perspectives on Recommender Systems and their Applications in Mendeley
 
Machine Learning @ Mendeley
Machine Learning @ MendeleyMachine Learning @ Mendeley
Machine Learning @ Mendeley
 
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...Mendeley’s Research Catalogue: building it, opening it up and making it even ...
Mendeley’s Research Catalogue: building it, opening it up and making it even ...
 
Mendeley Suggest: What will you read next?
Mendeley Suggest: What will you read next?Mendeley Suggest: What will you read next?
Mendeley Suggest: What will you read next?
 
Mendeley Suggest: Engineering a Personalised Article Recommender System
Mendeley Suggest: Engineering a Personalised Article Recommender SystemMendeley Suggest: Engineering a Personalised Article Recommender System
Mendeley Suggest: Engineering a Personalised Article Recommender System
 
Mendeley's Data and Perspectives on Data Challenges
Mendeley's Data and Perspectives on Data ChallengesMendeley's Data and Perspectives on Data Challenges
Mendeley's Data and Perspectives on Data Challenges
 
Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley
Mahout Becomes a Researcher: Large Scale Recommendations at MendeleyMahout Becomes a Researcher: Large Scale Recommendations at Mendeley
Mahout Becomes a Researcher: Large Scale Recommendations at Mendeley
 
improving explicit preference entry by visualising data similarities
improving explicit preference entry by visualising data similaritiesimproving explicit preference entry by visualising data similarities
improving explicit preference entry by visualising data similarities
 
Etude de la pertinence de critères de recherche en recherche d'informations s...
Etude de la pertinence de critères de recherche en recherche d'informations s...Etude de la pertinence de critères de recherche en recherche d'informations s...
Etude de la pertinence de critères de recherche en recherche d'informations s...
 
A Computational Model of Staged Language Acquisition
A Computational Model of Staged Language AcquisitionA Computational Model of Staged Language Acquisition
A Computational Model of Staged Language Acquisition
 
From Syllables to Syntax: Investigating Staged Linguistic Development through...
From Syllables to Syntax: Investigating Staged Linguistic Development through...From Syllables to Syntax: Investigating Staged Linguistic Development through...
From Syllables to Syntax: Investigating Staged Linguistic Development through...
 
A Collaborative Tool for the Computational Modelling of Child Language Acquis...
A Collaborative Tool for the Computational Modelling of Child Language Acquis...A Collaborative Tool for the Computational Modelling of Child Language Acquis...
A Collaborative Tool for the Computational Modelling of Child Language Acquis...
 
Mendeley: crowdsourcing and recommending research on a large scale
Mendeley: crowdsourcing and recommending research on a large scaleMendeley: crowdsourcing and recommending research on a large scale
Mendeley: crowdsourcing and recommending research on a large scale
 
Recommendation Engines for Scientific Literature
Recommendation Engines for Scientific LiteratureRecommendation Engines for Scientific Literature
Recommendation Engines for Scientific Literature
 
Cloud Elephants and Witches: A Big Data Tale from Mendeley
Cloud Elephants and Witches: A Big Data Tale from MendeleyCloud Elephants and Witches: A Big Data Tale from Mendeley
Cloud Elephants and Witches: A Big Data Tale from Mendeley
 

Dernier

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Dernier (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Scientific Article Recommendation with Mahout

  • 1. Scientific Article Recommendation with Mahout Kris Jack, PhD Senior Data Mining Engineer
  • 2. Use Case ➔ Good researchers are on top of their game ➔ Large amount of research produced ➔ Takes time to get at what you need ➔ Help researchers by recommending relevant research
  • 3. 1.5 million+ users; the 20 largest user bases: University of Cambridge Stanford University MIT University of Michigan Harvard University University of Oxford Sao Paulo University Imperial College London University of Edinburgh Cornell University University of California at Berkeley RWTH Aachen Columbia University Georgia Tech University of Wisconsin UC San Diego University of California at LA University of Florida 50m research articles University of North Carolina
  • 4. 1.5 million+ users; the 20 largest user bases: University of Cambridge Stanford University MIT University of Michigan We need a Harvard University University of Oxford recommender that Sao Paulo University scales up, coping with Imperial College London University of Edinburgh our data and future Cornell University University of California at Berkeley growth RWTH Aachen Columbia University Georgia Tech University of Wisconsin UC San Diego University of California at LA University of Florida 50m research articles University of North Carolina
  • 5.
  • 6. Questions ➔ How does Mahout's recommender work? ➔ How well does it perform out of the box? ➔ How well does it perform after some tuning?
  • 8. Generating recommendations through matrix multiplication This is item-based recommendations as similarity is based on items, not users org.apache.mahout.cf.taste.hadoop.item.RecommenderJob
  • 9. Researchers Turing Babbage Einstein Newton Comp Sci 1 Research Articles Comp Sci 2 Physics 1 Physics 2 Input (all user preferences)
  • 10. Researchers Turing Babbage Einstein Newton 1.5M Comp Sci 1 Research Articles Comp Sci 2 Physics 1 Physics 2 300M prefs 50M Input (all user preferences)
  • 11. Researchers Research Articles item.RecommenderJob 1. Prep. pref. matrix (1-3) 2. Gen. sim. matrix (4-6) 3. Multiply matrices (7-10) All User Preferences (item x user)
  • 12. Researchers Research Articles item.RecommenderJob 1. Prep. pref. matrix (1-3) 2. Gen. sim. matrix (4-6) 3. Multiply matrices (7-10) All User Preferences (item x user) Research Turing Articles A User's Preferences (item x user)
  • 13. Researchers Research Articles item.RecommenderJob 1. Prep. pref. matrix (1-3) 2. Gen. sim. matrix (4-6) 3. Multiply matrices (7-10) All User Preferences (item x user) Research Articles Turing 2 1 0 0 Research Research 0 0 Articles 1 1 Articles 0 0 2 2 0 0 2 2 Item Similarity A User's Preferences (item x item) (item x user)
  • 14. Researchers Research Articles item.RecommenderJob 1. Prep. pref. matrix (1-3) 2. Gen. sim. matrix (4-6) 3. Multiply matrices (7-10) All User Preferences (item x user) Research Articles Turing Turing 2 1 0 0 Research Research Research 0 0 Articles Articles 1 1 Articles 0 0 2 2 X = 0 0 2 2 Item Similarity A User's Preferences Recommendations (item x item) (item x user) (item x user)
  • 15. How well does it work?
  • 17. Running on Amazon's Elastic Map Reduce On demand use and easy to cost
  • 18. Mahout's Normalised Amazon Hours Performance No. Good Recommendations/10
  • 19. Mahout's Costly & Bad Normalised Amazon Hours Performance Costly & Good Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 20. Mahout's Costly & Bad Normalised Amazon Hours Performance Costly & Good Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 21. Mahout's Costly & Bad Normalised Amazon Hours Performance Costly & Good Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 22. Mahout's Costly & Bad Performance Costly & Good 7K Normalised Amazon Hours 6K 5K 4K 3K 2K 1K 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 23. Mahout's Costly & Bad Performance Costly & Good 7K 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K 2K 1K 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 25. 1. Reduce processing time 2. Improve quality
  • 26. 1. Reduce processing time ➔ Mahout's recommender is already efficient ➔ But your data may have unusual properties ➔ Hadoop may need a helping hand ➔ Let's see what's going on...
  • 27. Task Allocation 37 hours to complete 1 reducer allocated, despite having 48 available...
  • 28. Task Allocation Allocating more reducers on a per job basis job.getConfiguration().setInt( "mapred.reduce.tasks", numReducers); Allocating more mappers on a per job basis job.getConfiguration().set( "mapred.max.split.size", String.valueOf(splitSize));
  • 29. Task Allocation 37 hours to complete 14 hours From 1 → 40 reducers
  • 30. Partitioners 14 hours to complete
  • 31. Partitioners 14 hours to complete ~50KB ~500MB
  • 32. InputSampler.Sampler<IntWritable, Text> sampler = new InputSampler.RandomSampler<IntWritable, Text>(...); InputSampler.writePartitionFile(conf, sampler); conf.setPartitionerClass(TotalOrderPartitioner.class); http://www.philippeadjiman.com/blog/2009/12/20/hadoop-tutorial- series-issue-2-getting-started-with-customized-partitioning/
  • 33. Partitioners 14 hours to complete 2 hours Evenly distributed
  • 34. Mahout's Costly & Bad Performance Costly & Good 7K 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K 2K 1K 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 35. Mahout's Costly & Bad Performance Costly & Good 7K 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K Cust. item-based ➔ 2.4K, 1.5 2K 1K 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 36. Mahout's Costly & Bad Performance Costly & Good 7K 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K -4.1K (63%) 4K 3K Cust. item-based ➔ 2.4K, 1.5 2K 1K 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 37. 2. Improve quality ➔ Mahout provides item-based CF ➔ We have many more items than users ➔ Typically, user-based is more appropriate ➔ So let's make one!
  • 38. Researchers Research Articles item.RecommenderJob 1. Prep. pref. matrix (1-3) 2. Gen. sim. matrix (4-6) 3. Multiply matrices (7-10) All User Preferences (item x user) Research Articles Turing Turing 2 1 0 0 Research Research Research 0 0 Articles Articles 1 1 Articles 0 0 2 2 X = 0 0 2 2 Item Similarity A User's Preferences Recommendations (item x item) (item x user) (item x user)
  • 39. Researchers user Research Articles item.RecommenderJob 1. Prep. pref. matrix (1-3) 2. Gen. sim. matrix (4-6) 3. Multiply matrices (7-10) All User Preferences (item x user) Researchers Research Articles Turing Turing 2 1 0 0 Researchers Research Research Research 0 0 Articles Articles 1 1 Articles 0 0 2 2 X = 0 0 2 2 Item Similarity A User's Preferences Recommendations (item x item) (item x user) (item x user) User Similarity (user x user)
  • 40. Mahout's Costly & Bad Performance Costly & Good 7K 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K Cust. item-based ➔ 2.4K, 1.5 2K 1K 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 41. Mahout's Costly & Bad Performance Costly & Good 7K 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K Cust. item-based ➔ 2.4K, 1.5 2K Orig. user-based 1K ➔ 1K, 2.5 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 42. Mahout's Costly & Bad Performance Costly & Good 7K 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K Cust. item-based +1 (67%) ➔ 2.4K, 1.5 2K -1.4K Orig. user-based (58%) 1K ➔ 1K, 2.5 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 43. Mahout's Costly & Bad Performance Costly & Good 7K 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K 4K 3K Cust. item-based ➔ 2.4K, 1.5 2K Orig. user-based 1K ➔ 1K, 2.5 Cust. user-based ➔ 0.3K, 2.5 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 44. Mahout's Costly & Bad Performance Costly & Good 7K 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K -4.1K (63%) 4K 3K Cust. item-based ➔ 2.4K, 1.5 2K Orig. user-based 1K 1K, 2.5 ➔ -0.7K Cust. user-based (70%) ➔0.3K, 2.5 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 45. Mahout's Costly & Bad Performance Costly & Good 7K +1 (67%) 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K 4K -6.2K (95%) 3K Cust. item-based ➔ 2.4K, 1.5 2K Orig. user-based 1K ➔ 1K, 2.5 Cust. user-based ➔ 0.3K, 2.5 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good
  • 47. Conclusions ➔ Mahout is doing a great job of powering Mendeley Suggest ➔ Large scale data set ➔ Good quality recommendations ➔ Tuning helps ➔ Help Hadoop with task allocation if necessary ➔ Partition your data appropriately ➔ We save 95% resources ➔ Use an appropriate algorithm ➔ Item- vs user-based (MAHOUT-1004) ➔ We increase precision by 66.6%
  • 48. Mahout's Costly & Bad Performance Costly & Good 7K +1 (67%) 6.5K, 1.5 Normalised Amazon Hours 6K Orig. item-based 5K 4K -6.2K (95%) 3K Cust. item-based ➔ 2.4K, 1.5 2K Orig. user-based 1K ➔ 1K, 2.5 Cust. user-based ➔ 0.3K, 2.5 0 0.5 0 1 1.5 2 2.5 3 Cheap & Bad No. Good Recommendations/10 Cheap & Good http://www.mendeley.com/profiles/kris-jack/