SlideShare a Scribd company logo
1 of 22
Download to read offline
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Essentials of Mahout
Mastering Hadoop Map-reduce for Data Analysis


Shashank Tiwari
blog: shanky.org | twitter: @tshanky
st@treasuryofideas.com
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




What is Apache Mahout?

• A scalable machine learning infrastructure


• Built on top of Hadoop MapReduce


• Currently supports:


   • Clustering, classification, and collaborative filtering, etc...
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




A Little History

• Founded by folks active in the Lucene community


• Inspired by work at Stanford: “Map-Reduce for Machine Learning on
  Multicore” -- http://www.cs.stanford.edu/people/ang/papers/nips06-
  mapreducemulticore.pdf
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                        Copyright for all other & referenced work is retained by their respective owners.




Project Goal

• Create a community driven scalable and robust machine learning
  infrastructure


• Leverage Hadoop for parallel processing and scalability


• Provide an abstraction on top of Hadoop so the machine-learning users are
  not concerned with the map and reduce primitives when they build their
  solutions.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Supported Algorithms

 • Collaborative Filtering


 • User and Item based recommenders


 • K-Means, Fuzzy K-Means clustering


 • Mean Shift clustering


 • Dirichlet process clustering
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




More Supported Algorithms

 • Latent Dirichlet Allocation


 • Singular value decomposition


 • Parallel Frequent Pattern mining


 • Complementary Naive Bayes classifier


 • Random forest decision tree based classifier


 • ...and growing
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Focus Areas

 • Collaborative Filtering


 • Clustering


 • Classification
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Build and Install

• Required Software:


   • Java 1.6.x


   • Maven 2.0.11+


• Get source: svn co http://svn.apache.org/repos/asf/mahout/trunk mahout


• Compile & install core & examples: mvn install


   • Alternatively, individually mvn compile, mvn package, and mvn install
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                     Copyright for all other & referenced work is retained by their respective owners.




Recommendation Examples

 • mvn -q exec:java -
   Dexec.mainClass="org.apache.mahout.cf.taste.example.grouplens.Group
   LensRecommenderEvaluatorRunner" -Dexec.args="-i /Users/tshanky/
   workspace/hadoop_workspace/grouplens/ratings.dat"


 • https://cwiki.apache.org/confluence/display/MAHOUT/
   RecommendationExamples
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                      Copyright for all other & referenced work is retained by their respective owners.




Common Use Cases

 • Shopping: Amazon, Netflix


 • Who to follow/friend: Twitter/Facebook


 • Web resource classification, spam filtering, financial markets pattern
   recognition, classification
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Collaborative Filtering Basis

  • User-based: recommend items by finding similar users. User preferences
    keep changing so this method poses challenges.


  • Item-based: calculate similarity between items and make
    recommendations. Usually items don’t change much so the method is
    often reliable.


  • Slope-one: fast and efficient item based recommendation when user
    ratings are more than boolean yes/no, like/dislike.


  • Model-based: provide recommendation on the basis of developing a
    model of users and their ratings.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Clustering Basis

 • Clustering algorithms also use the notion of similarity to group similar
   items into a cluster.


 • Both Collaborative filtering and clustering use the notion of a distance,
   which could be calculated using a number of different techniques.


    • Example: Euclidean distance,
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




Mahout Taste Framework

• Taste Collaborative Filtering:


   • Taste is an open source project for CF started by Sean Owen on
     SourceForge and donated to Mahout in 2008.


   • Has been applied to a number of different data sets successfully.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                          Copyright for all other & referenced work is retained by their respective owners.




Mahout Taste Framework

• Taste Collaborative Filtering:


   • Taste is an open source project for CF started by Sean Owen on
     SourceForge and donated to Mahout in 2008.


   • Has been applied to a number of different data sets successfully.


• Mahout supports building recommendation engines primarily basis the Taste
  library.


   • The library supports both user-based and item-based recommendations.


• Can be used with Java or over RESTful web-service endpoints.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Taste Framework : Primary Classes

 • DataModel: Model for Users, Items, and Preferences


 • UserSimilarity: Interface defining the similarity between two users


 • ItemSimilarity: Interface defining the similarity between two items


 • Recommender: Interface for providing recommendations


 • UserNeighborhood: Interface for computing a neighborhood of similar
   users. These are used by the Recommenders.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                      Copyright for all other & referenced work is retained by their respective owners.




Taste Framework : Online vs Offline

 • Can do online recommendations for a few thousand data sets.


 • Leverages Hadoop for offline recommendation calculations on large data
   sets.
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Understanding the Group Lens Implementation

• Provide an insight into a sample Mahout Taste Framework Implementation.


• Uses the publicly available data set


• Part of the distribution so you can analyze it, modify it, and use it as an
  inspiration for your own implementation


• Easy to follow example
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                      Copyright for all other & referenced work is retained by their respective owners.




Group Lens Implementation Source

• GroupLensDataModel.java


• GroupLensRecommender.java


• GroupLensRecommenderBuilder.java


• GroupLensRecommenderEvaluatorRunner.java
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Group Lens Runner -- evaluator

• Instantiates an evaluator:


   • RecommenderEvaluator evaluator = new
     AverageAbsoluteDifferenceRecommenderEvaluator();


   • a “mean average error” algorithm


• Parses input parameters:


   • File ratingsFile = TasteOptionParser.getRatings(args);
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




Group Lens Runner -- data model

 • Parses a colon delimiter pattern file:


    • DataModel model = ratingsFile == null ? new GroupLensDataModel() :
      new GroupLensDataModel(ratingsFile);
Group Lens Runner -- evaluate with
                Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                       Copyright for all other & referenced work is retained by their respective owners.




recommendation builder

• evaluates using GroupLensRecommender


  • double evaluation = evaluator.evaluate(new
    GroupLensRecommenderBuilder(), null, model, 0.9, 0.3);
Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC.
                         Copyright for all other & referenced work is retained by their respective owners.




Questions?




• blog: shanky.org | twitter: @tshanky


• st@treasuryofideas.com

More Related Content

Similar to SDEC2011 Essentials of Mahout

SDEC2011 Essentials of Hive
SDEC2011 Essentials of HiveSDEC2011 Essentials of Hive
SDEC2011 Essentials of Hive
Korea Sdec
 
Recommendation engines : Matching items to users
Recommendation engines : Matching items to usersRecommendation engines : Matching items to users
Recommendation engines : Matching items to users
jobinwilson
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Lucidworks
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
Joaquin Delgado PhD.
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
S. Diana Hu
 

Similar to SDEC2011 Essentials of Mahout (20)

Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)Mahout Tutorial and Hands-on (version 2015)
Mahout Tutorial and Hands-on (version 2015)
 
Neev Open Source Contributions
Neev Open Source ContributionsNeev Open Source Contributions
Neev Open Source Contributions
 
Buidling large scale recommendation engine
Buidling large scale recommendation engineBuidling large scale recommendation engine
Buidling large scale recommendation engine
 
SDEC2011 Essentials of Hive
SDEC2011 Essentials of HiveSDEC2011 Essentials of Hive
SDEC2011 Essentials of Hive
 
Recommendation engines : Matching items to users
Recommendation engines : Matching items to usersRecommendation engines : Matching items to users
Recommendation engines : Matching items to users
 
Recommendation engines matching items to users
Recommendation engines matching items to usersRecommendation engines matching items to users
Recommendation engines matching items to users
 
Building Large Sustainable Apps
Building Large Sustainable AppsBuilding Large Sustainable Apps
Building Large Sustainable Apps
 
Docs as Part of the Product - Open Source Summit North America 2018
Docs as Part of the Product - Open Source Summit North America 2018Docs as Part of the Product - Open Source Summit North America 2018
Docs as Part of the Product - Open Source Summit North America 2018
 
Automated perf optimization - jQuery Conference
Automated perf optimization - jQuery ConferenceAutomated perf optimization - jQuery Conference
Automated perf optimization - jQuery Conference
 
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science LabScalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
Scalable Ensemble Machine Learning @ Harvard Health Policy Data Science Lab
 
MahoutNew
MahoutNewMahoutNew
MahoutNew
 
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedInDataEngConf SF16 - Methods for Content Relevance at LinkedIn
DataEngConf SF16 - Methods for Content Relevance at LinkedIn
 
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
Query-time Nonparametric Regression with Temporally Bounded Models - Patrick ...
 
Part of the DLM story: Get your Database under Source Control - SQL In The City
Part of the DLM story: Get your Database under Source Control - SQL In The City Part of the DLM story: Get your Database under Source Control - SQL In The City
Part of the DLM story: Get your Database under Source Control - SQL In The City
 
Presentation 1 Web--dev
Presentation 1 Web--devPresentation 1 Web--dev
Presentation 1 Web--dev
 
Collaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsCollaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro Analytics
 
Case study
Case studyCase study
Case study
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
LF_APIStrat17_Don't Repeat Yourself - Your API is Your Documentation
LF_APIStrat17_Don't Repeat Yourself - Your API is Your DocumentationLF_APIStrat17_Don't Repeat Yourself - Your API is Your Documentation
LF_APIStrat17_Don't Repeat Yourself - Your API is Your Documentation
 

More from Korea Sdec

SDEC2011 Big engineer vs small entreprenuer
SDEC2011 Big engineer vs small entreprenuerSDEC2011 Big engineer vs small entreprenuer
SDEC2011 Big engineer vs small entreprenuer
Korea Sdec
 
SDEC2011 Introducing Hadoop
SDEC2011 Introducing HadoopSDEC2011 Introducing Hadoop
SDEC2011 Introducing Hadoop
Korea Sdec
 
Sdec2011 shashank-introducing hadoop
Sdec2011 shashank-introducing hadoopSdec2011 shashank-introducing hadoop
Sdec2011 shashank-introducing hadoop
Korea Sdec
 
SDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modellingSDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modelling
Korea Sdec
 
SDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsSDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and models
Korea Sdec
 
Sdec2011 Introducing Hadoop
Sdec2011 Introducing HadoopSdec2011 Introducing Hadoop
Sdec2011 Introducing Hadoop
Korea Sdec
 
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and HiveSDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
Korea Sdec
 
SDEC2011 Going by TACC
SDEC2011 Going by TACCSDEC2011 Going by TACC
SDEC2011 Going by TACC
Korea Sdec
 

More from Korea Sdec (14)

SDEC2011 Big engineer vs small entreprenuer
SDEC2011 Big engineer vs small entreprenuerSDEC2011 Big engineer vs small entreprenuer
SDEC2011 Big engineer vs small entreprenuer
 
SDEC2011 Implementing me2day friend suggestion
SDEC2011 Implementing me2day friend suggestionSDEC2011 Implementing me2day friend suggestion
SDEC2011 Implementing me2day friend suggestion
 
SDEC2011 Introducing Hadoop
SDEC2011 Introducing HadoopSDEC2011 Introducing Hadoop
SDEC2011 Introducing Hadoop
 
Sdec2011 shashank-introducing hadoop
Sdec2011 shashank-introducing hadoopSdec2011 shashank-introducing hadoop
Sdec2011 shashank-introducing hadoop
 
SDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modellingSDEC2011 NoSQL Data modelling
SDEC2011 NoSQL Data modelling
 
SDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and modelsSDEC2011 NoSQL concepts and models
SDEC2011 NoSQL concepts and models
 
Sdec2011 Introducing Hadoop
Sdec2011 Introducing HadoopSdec2011 Introducing Hadoop
Sdec2011 Introducing Hadoop
 
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and HiveSDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
SDEC2011 Replacing legacy Telco DB/DW to Hadoop and Hive
 
SDEC2011 Rapidant
SDEC2011 RapidantSDEC2011 Rapidant
SDEC2011 Rapidant
 
SDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the whySDEC2011 Mahout - the what, the how and the why
SDEC2011 Mahout - the what, the how and the why
 
SDEC2011 Going by TACC
SDEC2011 Going by TACCSDEC2011 Going by TACC
SDEC2011 Going by TACC
 
SDEC2011 Glory-FS development & Experiences
SDEC2011 Glory-FS development & ExperiencesSDEC2011 Glory-FS development & Experiences
SDEC2011 Glory-FS development & Experiences
 
SDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speedSDEC2011 Using Couchbase for social game scaling and speed
SDEC2011 Using Couchbase for social game scaling and speed
 
SDEC2011 Arcus NHN memcached cloud
SDEC2011 Arcus NHN memcached cloudSDEC2011 Arcus NHN memcached cloud
SDEC2011 Arcus NHN memcached cloud
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Victor Rentea
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Recently uploaded (20)

Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
Modular Monolith - a Practical Alternative to Microservices @ Devoxx UK 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 

SDEC2011 Essentials of Mahout

  • 1. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Essentials of Mahout Mastering Hadoop Map-reduce for Data Analysis Shashank Tiwari blog: shanky.org | twitter: @tshanky st@treasuryofideas.com
  • 2. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. What is Apache Mahout? • A scalable machine learning infrastructure • Built on top of Hadoop MapReduce • Currently supports: • Clustering, classification, and collaborative filtering, etc...
  • 3. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. A Little History • Founded by folks active in the Lucene community • Inspired by work at Stanford: “Map-Reduce for Machine Learning on Multicore” -- http://www.cs.stanford.edu/people/ang/papers/nips06- mapreducemulticore.pdf
  • 4. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Project Goal • Create a community driven scalable and robust machine learning infrastructure • Leverage Hadoop for parallel processing and scalability • Provide an abstraction on top of Hadoop so the machine-learning users are not concerned with the map and reduce primitives when they build their solutions.
  • 5. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Supported Algorithms • Collaborative Filtering • User and Item based recommenders • K-Means, Fuzzy K-Means clustering • Mean Shift clustering • Dirichlet process clustering
  • 6. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. More Supported Algorithms • Latent Dirichlet Allocation • Singular value decomposition • Parallel Frequent Pattern mining • Complementary Naive Bayes classifier • Random forest decision tree based classifier • ...and growing
  • 7. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Focus Areas • Collaborative Filtering • Clustering • Classification
  • 8. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Build and Install • Required Software: • Java 1.6.x • Maven 2.0.11+ • Get source: svn co http://svn.apache.org/repos/asf/mahout/trunk mahout • Compile & install core & examples: mvn install • Alternatively, individually mvn compile, mvn package, and mvn install
  • 9. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Recommendation Examples • mvn -q exec:java - Dexec.mainClass="org.apache.mahout.cf.taste.example.grouplens.Group LensRecommenderEvaluatorRunner" -Dexec.args="-i /Users/tshanky/ workspace/hadoop_workspace/grouplens/ratings.dat" • https://cwiki.apache.org/confluence/display/MAHOUT/ RecommendationExamples
  • 10. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Common Use Cases • Shopping: Amazon, Netflix • Who to follow/friend: Twitter/Facebook • Web resource classification, spam filtering, financial markets pattern recognition, classification
  • 11. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Collaborative Filtering Basis • User-based: recommend items by finding similar users. User preferences keep changing so this method poses challenges. • Item-based: calculate similarity between items and make recommendations. Usually items don’t change much so the method is often reliable. • Slope-one: fast and efficient item based recommendation when user ratings are more than boolean yes/no, like/dislike. • Model-based: provide recommendation on the basis of developing a model of users and their ratings.
  • 12. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Clustering Basis • Clustering algorithms also use the notion of similarity to group similar items into a cluster. • Both Collaborative filtering and clustering use the notion of a distance, which could be calculated using a number of different techniques. • Example: Euclidean distance,
  • 13. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Mahout Taste Framework • Taste Collaborative Filtering: • Taste is an open source project for CF started by Sean Owen on SourceForge and donated to Mahout in 2008. • Has been applied to a number of different data sets successfully.
  • 14. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Mahout Taste Framework • Taste Collaborative Filtering: • Taste is an open source project for CF started by Sean Owen on SourceForge and donated to Mahout in 2008. • Has been applied to a number of different data sets successfully. • Mahout supports building recommendation engines primarily basis the Taste library. • The library supports both user-based and item-based recommendations. • Can be used with Java or over RESTful web-service endpoints.
  • 15. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Taste Framework : Primary Classes • DataModel: Model for Users, Items, and Preferences • UserSimilarity: Interface defining the similarity between two users • ItemSimilarity: Interface defining the similarity between two items • Recommender: Interface for providing recommendations • UserNeighborhood: Interface for computing a neighborhood of similar users. These are used by the Recommenders.
  • 16. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Taste Framework : Online vs Offline • Can do online recommendations for a few thousand data sets. • Leverages Hadoop for offline recommendation calculations on large data sets.
  • 17. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Understanding the Group Lens Implementation • Provide an insight into a sample Mahout Taste Framework Implementation. • Uses the publicly available data set • Part of the distribution so you can analyze it, modify it, and use it as an inspiration for your own implementation • Easy to follow example
  • 18. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Group Lens Implementation Source • GroupLensDataModel.java • GroupLensRecommender.java • GroupLensRecommenderBuilder.java • GroupLensRecommenderEvaluatorRunner.java
  • 19. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Group Lens Runner -- evaluator • Instantiates an evaluator: • RecommenderEvaluator evaluator = new AverageAbsoluteDifferenceRecommenderEvaluator(); • a “mean average error” algorithm • Parses input parameters: • File ratingsFile = TasteOptionParser.getRatings(args);
  • 20. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Group Lens Runner -- data model • Parses a colon delimiter pattern file: • DataModel model = ratingsFile == null ? new GroupLensDataModel() : new GroupLensDataModel(ratingsFile);
  • 21. Group Lens Runner -- evaluate with Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. recommendation builder • evaluates using GroupLensRecommender • double evaluation = evaluator.evaluate(new GroupLensRecommenderBuilder(), null, model, 0.9, 0.3);
  • 22. Confidential, for personal use only. All original content copyright owned by Treasury of Ideas LLC. Copyright for all other & referenced work is retained by their respective owners. Questions? • blog: shanky.org | twitter: @tshanky • st@treasuryofideas.com