SlideShare une entreprise Scribd logo
1  sur  21
Machine Learning
With Mahout
Rami Mukhtar
Big Data Group
National ICT Australia
February 2012
Mahout: Brief History
•  Started in 2008 as a subproject of the Apache
   Lucene;
        –  Text mining, clustering, some classification.
•  Sean Owen started Taste in 2005:
        –  Recommender engine for business that never took off
        –  Mahout community asked to merge in Taste code
•  Became a top level Apache Project in April 2010
•  Lineage resulted in a fragmented framework.




NICTA Copyright 2011        From imagination to impact       2
Mahout: What is it?
•  Collection of machine learning algorithm implementations:
        –  Many (not all) implemented on Hadoop map-reduce;
        –  Java library with handy command line interface to run common tasks.
•  Currently serves 3 key areas:
        –  Recommendation engines
        –  Clustering
        –  Classification
•  Focus of today’s talk is on functionality accessible from command
   line interface:
        –  Most accessible for Hadoop beginners.




NICTA Copyright 2011               From imagination to impact                    3
Recommenders
•  Supports user based and item based
   collaborative filtering:
        –  User based: similarity between users;
        –  Item based: similarity between items                 user

      other items          1                2               3       user 3 likes item E
      user 1 may
          like
                                                                        item
                       A   B        C              D        E   F




                           4                5               6
NICTA Copyright 2011           From imagination to impact                          4
Implementations
•  Non-distributed (no Hadoop requirement)
        –  The ‘Taste’ code, supports item and user based;
        –  Good for up to 100 million user-item associations;
        –  Faster than distributed version.
•  Distributed (Hadoop MapReduce)
        –  Item based using similarity measure (configurable)
           between items.
        –  Latent factor based:
               •  Estimates ‘genres’ of items from user preferences
               •  Similar to entry that won the NetFlix prize.
        –  Both have command line interfaces.


NICTA Copyright 2011              From imagination to impact          5
Distributed Item Recommender
                Item1 Item 2 Item 3                   Item n

                       R             R        R
  User 1
                                                                                    Similarity
  User 2               R             R                    R
                                                                                    calculation

  User 3               R      R      R        R

                              R               R
                                                                           1   2           n
                              R                           R            1   1   .2     …    .8

                                                                       2       1      …    .5
                       R             R
                                                                                      …    .6
                       R             R        R           R
  User m                                                               n                   1

                           User-item ratings matrix                        Item similarity
NICTA Copyright 2011                      From imagination to impact           matrix             6
Distributed Item Recommendation
                                csv file:
                                                                         csv file:
                                user, item, rating
                                                                         item, item, simularity
                                …

mahout itemsimilarity –i <input_file> -o <output_path> …!

                       Item1 Item 2 Item 3                Item n

     User 1              R            R           R

     User 2              R            R                       R

     User 3              R     R      R           R                          s2,3 * R2 + s3,5 * R5
                               R2   R3?          R5
                                                                       R3? =
                                                                                   s2,3 + s3,5
                               R                              R

                         R            R

    User m               R            R           R           R
                                                      €
NICTA Copyright 2011                      From imagination to impact                              7
Distributed Item Recommendation
 •  Can perform item similarity and
    recommendation generation in a single
    call:    csv file (tab seperated):
                        user, item, rating
                        …

mahout recommenditembased –i <input_path> !
-o <output_path> !
-u <users_file>!                    csv file (tab separated):
--numRecommendations …!             user,item,score,item,score,…

                                                               csv file (tab
     Number of
                                                               separated):
     recommendations to                                        user
     return per user
                                                               user
                                                               user
 NICTA Copyright 2011             From imagination to impact                   8
Clustering




                       Vectorization




                                          Clustering




NICTA Copyright 2011                   From imagination to impact   9
Clustering
•  Don’t know the structure of data, want to sensibly group
   things together.
•  A number of distributed algorithms supported:
        –  Canopy Clustering (MAHOUT-3 – integrated)
        –  K-Means Clustering (MAHOUT-5 – integrated)
        –  Fuzzy K-Means (MAHOUT-74 – integrated)
        –  Expectation Maximization (EM) (MAHOUT-28)
        –  Mean Shift Clustering (MAHOUT-15 – integrated)
        –  Hierarchical Clustering (MAHOUT-19)
        –  Dirichlet Process Clustering (MAHOUT-30 – integrated)
        –  Latent Dirichlet Allocation (MAHOUT-123 – integrated)
        –  Spectral Clustering (MAHOUT-363 – integrated)
        –  Minhash Clustering (MAHOUT-344 - integrated)
•  Some have command line interface support.
NICTA Copyright 2011          From imagination to impact           10
Vectorization
•  Data specific.
        –  Majority of cases need to write a map-reduce job to generate
           vectorized input
        –  Input formats are still not uniform across Mahout.
        –  Most clustering implementations expect:
               •  SequenceFile(WritableComparable, VectorWritable)!
               •  Note: key is ignored.
•  Mahout has some support for clustering text documents:
        –  Can generate n-gram Term Frequency-Inverse Document
           Frequency (TF-IDF) from a directory of text documents;
        –  Enables text documents to be clustered using command line
           interface.




NICTA Copyright 2011               From imagination to impact             11
Text Document Vectorization

                       Term frequency
                       The | conduct | as | run | doctor | with | a | Patel!
                       47 | 3        | 7 | 5    | 8      | 12   | 54| 6 !

 Document
                       Document frequency
                       The | conduct | as | run | doctor | with | a | Patel!
                       1000| 198     | 999| 567 | 48     | 998 |100| 3 !


                                                N
  Corpus                    TFIDFi = TFi * log
                                               DFi
Unigram: (crude)!
                                                         Increase weight of less
Bi-gram: (crude, oil)!
                                                         common words/n-
Tri-gram: (crude, oil, prices)!
                                                         grams within corpus
NICTA Copyright 2011               From imagination to impact                      12

       €
Text Document Vectorization

Directory of                                                                      Sequence
plain text                                                                        file: <name,
documents                                                                         text body>




mahout seqdirectory –i <input_path> -o <seq_output_path>!
                                                                     Dictionary
                                       Term Frequency                file

                          Ngram
                                                                   TF-IDF
                        generation
                                                                  Vec. Gen.
                                          Inverse
                                       Document Freq.

 mahout seq2sparse -i <seq_input_path> -o <output_path> !
 NICTA Copyright 2011                From imagination to impact                           13
K-means clustering
Run k-means clustering

mahout kmeans!
-i <input vectors directory>!        org.apache.mahout.common.distance.
-c <input clusters directory>!       CosineDistanceMeasure
                                      EuclideanDistanceMeasure
-o <output working directory> !       ManhattanDistanceMeasure
-k <# clusters sampled from input> ! SquaredEuclideanDistanceMeasure
-dm <DistanceMeasure> !               …
-x <maximum number of iterations> !
-xm <execution method: seq/mapreduce>!
…!
                           Cluster 1              Cluster 2
 Inspect the result               Top Terms: !                        Top Terms: !
                                  oil                    =>   6.20!   Coresponsibility   =>   13.97!
                                  barrel                 =>   5.15!   cereals            =>   13.51!
 mahout clusterdump !             crude                  =>   5.06!   penalise           =>   13.25!
                                  prices                 =>   4.50!   farmers            =>   11.99!
 -dt sequencefile !               opec                   =>   3.23!   levies             =>   11.60!
                                  price                  =>   2.77!   ceilings           =>   11.52!
 -d <dictionary_file>!            dlrs                   =>   2.76!   ec                 =>   11.07!
                                  said                   =>   2.70!   ministers          =>   10.55!
 -s <input_seq_file>!             bpd                    =>   2.45!   output             =>   9.57!
                                  petroleum              =>   1.99!   09.73              =>   9.18!



NICTA Copyright 2011        From imagination to impact                                           14
Classification
•  Train the machine to provide discrete answers to a
   specific question.
                                                    Mahout supports the
          100100: A                                 following algorithms:
          010011: A          Model                  -  ogistic Regression
                                                     L
          010110: B         Training                -  aïve Bayes
                                                     N
          100101: A         Algorithm               -  andom Forests
                                                     R
          010101: B                                 Others in development
     Data with known
        answers
          101100: ?                                        101100: A
          010010: ?                                        000110: B
          001000: ?           Trained
                                                           011100: A
          100000: ?            Model
                                                           101101: A
          001001: ?                                        010111: B
        Data without                                        Data with
         answers                                            estimated
NICTA Copyright 2011   From imagination to impact            answers        15
Classification Workflow
                          Label                   Sample
                                                  ~90%

          Training set                                                         1
                                                                   Model
                         Vectorize                Sample          Training
                                                  ~10%




                                                                               2
                                                                   Model
                                                                   Testing



                                                                               3
          Input set                                                                  A, B, A,
                         Vectorize                                                   A, B, …


                                                               Trained Model           Label
NICTA Copyright 2011              From imagination to impact                                 16
                                                                                   approximation
Feature Extraction
•  Good feature extraction is critical to
   trained model performance:
        – Need domain understanding to ‘measure’ the
          right things.
        – Measure wrong things, even the best model
          will perform badly.
        – Caution needed to avoid ‘label leaks’.
•  Will typically require hand written map-
   reduce code:
        – If text based, can use text mining tools in
          HIVE or Mahout.
NICTA Copyright 2011     From imagination to impact     17
Naïve Bayes Classifier
      Feature vector                       Label


                                                             n
classify ( f1, f 2 ,… f n ) = argmax p( L = l)∏ p( Fi = f i | L = l)
                                       l                     i=1

                              Probability of feature i having value fi given l,
                              e.g. assume a Gaussian pdf:
                                                                       (v −µl )2
                                                                 1        −
                                                                              2σ l2
                                  P ( f = v | l) =                    e
                                                             2πσ l2
 Note: Model training boils down to estimating the conditional variance of the
 feature vector elements. This can be trivially parallelized and implemented in
 map reduce.
                        €                                                             18
 NICTA Copyright 2011           From imagination to impact
Naïve Bayes in Mahout
 Command line specific to text classification (e.g. SPAM detection, document
 classification, etc.)

   Plain text file, format:                                 Generated model, set
   label t word word                                       of files in sequence file
   word …                                                   format (variances).



mahout trainclassifier –i <input_path> -o <output_path>
--gramSize <n_gram_size> -minDf <minimum_DF> -minSupport <min_TF> …




              N_gram                                               Discard n-grams
                                     Discard n-grams
              size,                                                that occur less
                                     that occur in less
              default = 1                                          than this number
                                     than this number
                                                                   of times in a
                                     of documents.
                                                                   document.
NICTA Copyright 2011           From imagination to impact                               19
Naïve Bayes in Mahout
 •  Need to write your own classifier to be
    practical.

                       document
                                          Classifier              Trained Model

                        label

Look at class:
org.apache.mahout.classifier.bayes.algorithm.BayesAlgorithm!
• Classify document;
• Return top n predicted labels;
• Return classification certainty;
• …

NICTA Copyright 2011                 From imagination to impact               20
Classification vs. Recommendation
•  Can use a classifier to recommend:
        – Interested in item or not interested?
•  Classifier is based on features of the
   specific item and the customer
•  Recommendation based on past behavior
   of customers
•  Classification: single decisions
•  Recommendation: ranking


NICTA Copyright 2011     From imagination to impact   21

Contenu connexe

Tendances

Scalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduceScalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReducePietro Michiardi
 
Co-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and SparkCo-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and Sparksscdotopen
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)PyData
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReducePietro Michiardi
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Deanna Kosaraju
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingEd Kohlwey
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXrhatr
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce ParadigmDilip Reddy
 
Map Reduce data types and formats
Map Reduce data types and formatsMap Reduce data types and formats
Map Reduce data types and formatsVigen Sahakyan
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...DB Tsai
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat DetectionJen Aman
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabVijay Srinivas Agneeswaran, Ph.D
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1Stefanie Zhao
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduceDavid Gleich
 

Tendances (20)

Scalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduceScalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduce
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Map Reduce introduction
Map Reduce introductionMap Reduce introduction
Map Reduce introduction
 
Pig Experience
Pig ExperiencePig Experience
Pig Experience
 
Co-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and SparkCo-occurrence Based Recommendations with Mahout, Scala and Spark
Co-occurrence Based Recommendations with Mahout, Scala and Spark
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 
Relational Algebra and MapReduce
Relational Algebra and MapReduceRelational Algebra and MapReduce
Relational Algebra and MapReduce
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Beyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel ProcessingBeyond Map/Reduce: Getting Creative With Parallel Processing
Beyond Map/Reduce: Getting Creative With Parallel Processing
 
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphXIntroduction into scalable graph analysis with Apache Giraph and Spark GraphX
Introduction into scalable graph analysis with Apache Giraph and Spark GraphX
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Hadoop map reduce concepts
Hadoop map reduce conceptsHadoop map reduce concepts
Hadoop map reduce concepts
 
Map Reduce data types and formats
Map Reduce data types and formatsMap Reduce data types and formats
Map Reduce data types and formats
 
Neo4j vs giraph
Neo4j vs giraphNeo4j vs giraph
Neo4j vs giraph
 
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
2014-10-20 Large-Scale Machine Learning with Apache Spark at Internet of Thin...
 
A Graph-Based Method For Cross-Entity Threat Detection
 A Graph-Based Method For Cross-Entity Threat Detection A Graph-Based Method For Cross-Entity Threat Detection
A Graph-Based Method For Cross-Entity Threat Detection
 
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLabBeyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
Beyond Hadoop 1.0: A Holistic View of Hadoop YARN, Spark and GraphLab
 
BDAS Shark study report 03 v1.1
BDAS Shark study report  03 v1.1BDAS Shark study report  03 v1.1
BDAS Shark study report 03 v1.1
 
Spark at-hackthon8jan2014
Spark at-hackthon8jan2014Spark at-hackthon8jan2014
Spark at-hackthon8jan2014
 
Sparse matrix computations in MapReduce
Sparse matrix computations in MapReduceSparse matrix computations in MapReduce
Sparse matrix computations in MapReduce
 

En vedette

HOW_DATA_CAN_HELP_TO_REDUCE_AVIATION_ACCIDENTS
HOW_DATA_CAN_HELP_TO_REDUCE_AVIATION_ACCIDENTSHOW_DATA_CAN_HELP_TO_REDUCE_AVIATION_ACCIDENTS
HOW_DATA_CAN_HELP_TO_REDUCE_AVIATION_ACCIDENTSSunil Kakade
 
Modul kelompok viii_-_copy_-_copy[1]
Modul kelompok viii_-_copy_-_copy[1]Modul kelompok viii_-_copy_-_copy[1]
Modul kelompok viii_-_copy_-_copy[1]rillafebrila
 
Business model presentation
Business model presentationBusiness model presentation
Business model presentationMaria Kyamulabye
 
Digital Citizenship & Internet Maturity for Schools
Digital Citizenship & Internet Maturity for SchoolsDigital Citizenship & Internet Maturity for Schools
Digital Citizenship & Internet Maturity for SchoolsRaghu Pandey
 
Marketing plan for android app
Marketing plan for android appMarketing plan for android app
Marketing plan for android appSai Sachin
 
Machine Learning and Hadoop: Present and future
Machine Learning and Hadoop: Present and futureMachine Learning and Hadoop: Present and future
Machine Learning and Hadoop: Present and futureCloudera, Inc.
 
UX Designer Skills
UX Designer SkillsUX Designer Skills
UX Designer SkillsPhowr Quang
 

En vedette (12)

Principio de la prueba
Principio de la pruebaPrincipio de la prueba
Principio de la prueba
 
HOW_DATA_CAN_HELP_TO_REDUCE_AVIATION_ACCIDENTS
HOW_DATA_CAN_HELP_TO_REDUCE_AVIATION_ACCIDENTSHOW_DATA_CAN_HELP_TO_REDUCE_AVIATION_ACCIDENTS
HOW_DATA_CAN_HELP_TO_REDUCE_AVIATION_ACCIDENTS
 
Modul kelompok viii_-_copy_-_copy[1]
Modul kelompok viii_-_copy_-_copy[1]Modul kelompok viii_-_copy_-_copy[1]
Modul kelompok viii_-_copy_-_copy[1]
 
UGI Auburn Line Extension Map & Details
UGI Auburn Line Extension Map & DetailsUGI Auburn Line Extension Map & Details
UGI Auburn Line Extension Map & Details
 
Business model presentation
Business model presentationBusiness model presentation
Business model presentation
 
Resume-santosh
Resume-santoshResume-santosh
Resume-santosh
 
RBC
RBCRBC
RBC
 
TorchFi platform
TorchFi platformTorchFi platform
TorchFi platform
 
Digital Citizenship & Internet Maturity for Schools
Digital Citizenship & Internet Maturity for SchoolsDigital Citizenship & Internet Maturity for Schools
Digital Citizenship & Internet Maturity for Schools
 
Marketing plan for android app
Marketing plan for android appMarketing plan for android app
Marketing plan for android app
 
Machine Learning and Hadoop: Present and future
Machine Learning and Hadoop: Present and futureMachine Learning and Hadoop: Present and future
Machine Learning and Hadoop: Present and future
 
UX Designer Skills
UX Designer SkillsUX Designer Skills
UX Designer Skills
 

Similaire à Machine Learning With Mahout: Clustering, Recommendations, and More

Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorialBuilding Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorialXavier Amatriain
 
Linked Open Data to support content based Recommender Systems
Linked Open Data to support content based Recommender SystemsLinked Open Data to support content based Recommender Systems
Linked Open Data to support content based Recommender SystemsVito Ostuni
 
Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…
Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…
Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…Roku
 
Hanna bosc2010
Hanna bosc2010Hanna bosc2010
Hanna bosc2010BOSC 2010
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Herman Wu
 
Recommendations play @flipkart (3)
Recommendations play @flipkart (3)Recommendations play @flipkart (3)
Recommendations play @flipkart (3)hava101
 
Big data analytics with R tool.pptx
Big data analytics with R tool.pptxBig data analytics with R tool.pptx
Big data analytics with R tool.pptxsalutiontechnology
 
Study of R Programming
Study of R ProgrammingStudy of R Programming
Study of R ProgrammingIRJET Journal
 
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Claudio Greco
 
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Alessandro Suglia
 
Dynamic Synthesis of Mediators to Support Interoperability in Autonomic Systems
Dynamic Synthesis of Mediators to Support Interoperability in Autonomic SystemsDynamic Synthesis of Mediators to Support Interoperability in Autonomic Systems
Dynamic Synthesis of Mediators to Support Interoperability in Autonomic SystemsAmel Bennaceur
 
Srinivas Muddana Resume
Srinivas Muddana ResumeSrinivas Muddana Resume
Srinivas Muddana Resumemuddanas
 
Srinivas Muddana Resume
Srinivas Muddana ResumeSrinivas Muddana Resume
Srinivas Muddana Resumemuddanas
 
Srinivas Muddana Resume
Srinivas Muddana ResumeSrinivas Muddana Resume
Srinivas Muddana Resumemuddanas
 
Intro to data science module 1 r
Intro to data science module 1 rIntro to data science module 1 r
Intro to data science module 1 ramuletc
 

Similaire à Machine Learning With Mahout: Clustering, Recommendations, and More (20)

Introduction to R software, by Leire ibaibarriaga
Introduction to R software, by Leire ibaibarriaga Introduction to R software, by Leire ibaibarriaga
Introduction to R software, by Leire ibaibarriaga
 
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorialBuilding Large-scale Real-world Recommender Systems - Recsys2012 tutorial
Building Large-scale Real-world Recommender Systems - Recsys2012 tutorial
 
Linked Open Data to support content based Recommender Systems
Linked Open Data to support content based Recommender SystemsLinked Open Data to support content based Recommender Systems
Linked Open Data to support content based Recommender Systems
 
Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…
Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…
Linked Open Data to Support Content-based Recommender Systems - I-SEMANTIC…
 
Hanna bosc2010
Hanna bosc2010Hanna bosc2010
Hanna bosc2010
 
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習 Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
Azure 機器學習 - 使用Python, R, Spark, CNTK 深度學習
 
Recommendations play @flipkart (3)
Recommendations play @flipkart (3)Recommendations play @flipkart (3)
Recommendations play @flipkart (3)
 
Big data analytics with R tool.pptx
Big data analytics with R tool.pptxBig data analytics with R tool.pptx
Big data analytics with R tool.pptx
 
Study of R Programming
Study of R ProgrammingStudy of R Programming
Study of R Programming
 
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
 
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
Ask Me Any Rating: A Content-based Recommender System based on Recurrent Neur...
 
Dynamic Synthesis of Mediators to Support Interoperability in Autonomic Systems
Dynamic Synthesis of Mediators to Support Interoperability in Autonomic SystemsDynamic Synthesis of Mediators to Support Interoperability in Autonomic Systems
Dynamic Synthesis of Mediators to Support Interoperability in Autonomic Systems
 
Big Data Analysis Starts with R
Big Data Analysis Starts with RBig Data Analysis Starts with R
Big Data Analysis Starts with R
 
Using R with Hadoop
Using R with HadoopUsing R with Hadoop
Using R with Hadoop
 
Resume
ResumeResume
Resume
 
Srinivas Muddana Resume
Srinivas Muddana ResumeSrinivas Muddana Resume
Srinivas Muddana Resume
 
Srinivas Muddana Resume
Srinivas Muddana ResumeSrinivas Muddana Resume
Srinivas Muddana Resume
 
Srinivas Muddana Resume
Srinivas Muddana ResumeSrinivas Muddana Resume
Srinivas Muddana Resume
 
Intro to data science module 1 r
Intro to data science module 1 rIntro to data science module 1 r
Intro to data science module 1 r
 
Scala Days NYC 2016
Scala Days NYC 2016Scala Days NYC 2016
Scala Days NYC 2016
 

Dernier

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DaySri Ambati
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 

Dernier (20)

Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo DayH2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
H2O.ai CEO/Founder: Sri Ambati Keynote at Wells Fargo Day
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 

Machine Learning With Mahout: Clustering, Recommendations, and More

  • 1. Machine Learning With Mahout Rami Mukhtar Big Data Group National ICT Australia February 2012
  • 2. Mahout: Brief History •  Started in 2008 as a subproject of the Apache Lucene; –  Text mining, clustering, some classification. •  Sean Owen started Taste in 2005: –  Recommender engine for business that never took off –  Mahout community asked to merge in Taste code •  Became a top level Apache Project in April 2010 •  Lineage resulted in a fragmented framework. NICTA Copyright 2011 From imagination to impact 2
  • 3. Mahout: What is it? •  Collection of machine learning algorithm implementations: –  Many (not all) implemented on Hadoop map-reduce; –  Java library with handy command line interface to run common tasks. •  Currently serves 3 key areas: –  Recommendation engines –  Clustering –  Classification •  Focus of today’s talk is on functionality accessible from command line interface: –  Most accessible for Hadoop beginners. NICTA Copyright 2011 From imagination to impact 3
  • 4. Recommenders •  Supports user based and item based collaborative filtering: –  User based: similarity between users; –  Item based: similarity between items user other items 1 2 3 user 3 likes item E user 1 may like item A B C D E F 4 5 6 NICTA Copyright 2011 From imagination to impact 4
  • 5. Implementations •  Non-distributed (no Hadoop requirement) –  The ‘Taste’ code, supports item and user based; –  Good for up to 100 million user-item associations; –  Faster than distributed version. •  Distributed (Hadoop MapReduce) –  Item based using similarity measure (configurable) between items. –  Latent factor based: •  Estimates ‘genres’ of items from user preferences •  Similar to entry that won the NetFlix prize. –  Both have command line interfaces. NICTA Copyright 2011 From imagination to impact 5
  • 6. Distributed Item Recommender Item1 Item 2 Item 3 Item n R R R User 1 Similarity User 2 R R R calculation User 3 R R R R R R 1 2 n R R 1 1 .2 … .8 2 1 … .5 R R … .6 R R R R User m n 1 User-item ratings matrix Item similarity NICTA Copyright 2011 From imagination to impact matrix 6
  • 7. Distributed Item Recommendation csv file: csv file: user, item, rating item, item, simularity … mahout itemsimilarity –i <input_file> -o <output_path> …! Item1 Item 2 Item 3 Item n User 1 R R R User 2 R R R User 3 R R R R s2,3 * R2 + s3,5 * R5 R2 R3? R5 R3? = s2,3 + s3,5 R R R R User m R R R R € NICTA Copyright 2011 From imagination to impact 7
  • 8. Distributed Item Recommendation •  Can perform item similarity and recommendation generation in a single call: csv file (tab seperated): user, item, rating … mahout recommenditembased –i <input_path> ! -o <output_path> ! -u <users_file>! csv file (tab separated): --numRecommendations …! user,item,score,item,score,… csv file (tab Number of separated): recommendations to user return per user user user NICTA Copyright 2011 From imagination to impact 8
  • 9. Clustering Vectorization Clustering NICTA Copyright 2011 From imagination to impact 9
  • 10. Clustering •  Don’t know the structure of data, want to sensibly group things together. •  A number of distributed algorithms supported: –  Canopy Clustering (MAHOUT-3 – integrated) –  K-Means Clustering (MAHOUT-5 – integrated) –  Fuzzy K-Means (MAHOUT-74 – integrated) –  Expectation Maximization (EM) (MAHOUT-28) –  Mean Shift Clustering (MAHOUT-15 – integrated) –  Hierarchical Clustering (MAHOUT-19) –  Dirichlet Process Clustering (MAHOUT-30 – integrated) –  Latent Dirichlet Allocation (MAHOUT-123 – integrated) –  Spectral Clustering (MAHOUT-363 – integrated) –  Minhash Clustering (MAHOUT-344 - integrated) •  Some have command line interface support. NICTA Copyright 2011 From imagination to impact 10
  • 11. Vectorization •  Data specific. –  Majority of cases need to write a map-reduce job to generate vectorized input –  Input formats are still not uniform across Mahout. –  Most clustering implementations expect: •  SequenceFile(WritableComparable, VectorWritable)! •  Note: key is ignored. •  Mahout has some support for clustering text documents: –  Can generate n-gram Term Frequency-Inverse Document Frequency (TF-IDF) from a directory of text documents; –  Enables text documents to be clustered using command line interface. NICTA Copyright 2011 From imagination to impact 11
  • 12. Text Document Vectorization Term frequency The | conduct | as | run | doctor | with | a | Patel! 47 | 3 | 7 | 5 | 8 | 12 | 54| 6 ! Document Document frequency The | conduct | as | run | doctor | with | a | Patel! 1000| 198 | 999| 567 | 48 | 998 |100| 3 ! N Corpus TFIDFi = TFi * log DFi Unigram: (crude)! Increase weight of less Bi-gram: (crude, oil)! common words/n- Tri-gram: (crude, oil, prices)! grams within corpus NICTA Copyright 2011 From imagination to impact 12 €
  • 13. Text Document Vectorization Directory of Sequence plain text file: <name, documents text body> mahout seqdirectory –i <input_path> -o <seq_output_path>! Dictionary Term Frequency file Ngram TF-IDF generation Vec. Gen. Inverse Document Freq. mahout seq2sparse -i <seq_input_path> -o <output_path> ! NICTA Copyright 2011 From imagination to impact 13
  • 14. K-means clustering Run k-means clustering mahout kmeans! -i <input vectors directory>! org.apache.mahout.common.distance. -c <input clusters directory>! CosineDistanceMeasure EuclideanDistanceMeasure -o <output working directory> ! ManhattanDistanceMeasure -k <# clusters sampled from input> ! SquaredEuclideanDistanceMeasure -dm <DistanceMeasure> ! … -x <maximum number of iterations> ! -xm <execution method: seq/mapreduce>! …! Cluster 1 Cluster 2 Inspect the result Top Terms: ! Top Terms: ! oil => 6.20! Coresponsibility => 13.97! barrel => 5.15! cereals => 13.51! mahout clusterdump ! crude => 5.06! penalise => 13.25! prices => 4.50! farmers => 11.99! -dt sequencefile ! opec => 3.23! levies => 11.60! price => 2.77! ceilings => 11.52! -d <dictionary_file>! dlrs => 2.76! ec => 11.07! said => 2.70! ministers => 10.55! -s <input_seq_file>! bpd => 2.45! output => 9.57! petroleum => 1.99! 09.73 => 9.18! NICTA Copyright 2011 From imagination to impact 14
  • 15. Classification •  Train the machine to provide discrete answers to a specific question. Mahout supports the 100100: A following algorithms: 010011: A Model -  ogistic Regression L 010110: B Training -  aïve Bayes N 100101: A Algorithm -  andom Forests R 010101: B Others in development Data with known answers 101100: ? 101100: A 010010: ? 000110: B 001000: ? Trained 011100: A 100000: ? Model 101101: A 001001: ? 010111: B Data without Data with answers estimated NICTA Copyright 2011 From imagination to impact answers 15
  • 16. Classification Workflow Label Sample ~90% Training set 1 Model Vectorize Sample Training ~10% 2 Model Testing 3 Input set A, B, A, Vectorize A, B, … Trained Model Label NICTA Copyright 2011 From imagination to impact 16 approximation
  • 17. Feature Extraction •  Good feature extraction is critical to trained model performance: – Need domain understanding to ‘measure’ the right things. – Measure wrong things, even the best model will perform badly. – Caution needed to avoid ‘label leaks’. •  Will typically require hand written map- reduce code: – If text based, can use text mining tools in HIVE or Mahout. NICTA Copyright 2011 From imagination to impact 17
  • 18. Naïve Bayes Classifier Feature vector Label n classify ( f1, f 2 ,… f n ) = argmax p( L = l)∏ p( Fi = f i | L = l) l i=1 Probability of feature i having value fi given l, e.g. assume a Gaussian pdf: (v −µl )2 1 − 2σ l2 P ( f = v | l) = e 2πσ l2 Note: Model training boils down to estimating the conditional variance of the feature vector elements. This can be trivially parallelized and implemented in map reduce. € 18 NICTA Copyright 2011 From imagination to impact
  • 19. Naïve Bayes in Mahout Command line specific to text classification (e.g. SPAM detection, document classification, etc.) Plain text file, format: Generated model, set label t word word of files in sequence file word … format (variances). mahout trainclassifier –i <input_path> -o <output_path> --gramSize <n_gram_size> -minDf <minimum_DF> -minSupport <min_TF> … N_gram Discard n-grams Discard n-grams size, that occur less that occur in less default = 1 than this number than this number of times in a of documents. document. NICTA Copyright 2011 From imagination to impact 19
  • 20. Naïve Bayes in Mahout •  Need to write your own classifier to be practical. document Classifier Trained Model label Look at class: org.apache.mahout.classifier.bayes.algorithm.BayesAlgorithm! • Classify document; • Return top n predicted labels; • Return classification certainty; • … NICTA Copyright 2011 From imagination to impact 20
  • 21. Classification vs. Recommendation •  Can use a classifier to recommend: – Interested in item or not interested? •  Classifier is based on features of the specific item and the customer •  Recommendation based on past behavior of customers •  Classification: single decisions •  Recommendation: ranking NICTA Copyright 2011 From imagination to impact 21