SlideShare a Scribd company logo
1 of 37
Learning with
   Drew Farris
   Committer to Apache Mahout since 2/2010
     ..not as active in the past year 

     Author: Taming Text
     My Company: (and BarCamp DC Sponsor)
   Mahout (as in hoot) or Mahout (as in trout)?
   A scalable machine learning library
   A scalable machine learning library
     ‘large’ data sets
     Often Hadoop
     ..but sometimes not
   A scalable machine learning library
     Recommendation Mining
   A scalable machine learning library
     Recommendation Mining
     Clustering
   A scalable machine learning library
     Recommendation Mining
     Clustering
     Classification
   A scalable machine learning library
     Recommendation Mining
     Clustering
     Classification
     Association Mining
   A scalable machine learning library
     Recommendation Mining
     Clustering
     Classification
     Association Mining
     A reasonable linear algebra library
     A reasonable library of collections
   A scalable machine learning library
     Recommendation Mining
     Clustering
     Classification
     Association Mining
     A reasonable linear algebra library
     A reasonable library of collections
     Other Stuff
   Getting Started
     Check out & build the code
      ▪ git clone git://git.apache.org/mahout.git
      ▪ mvn install –DskipTests=true
      ▪ The tests take a looong time to run, not needed for intial build
     Or use the Cloudera Virtual Machine (http://bit.ly/MyBnFi)
   Getting Started
     Check out & build the code
     Examples in examples/bin
   Getting Started
     Check out & build the code
     Examples in examples/bin
     Wiki (http://mahout.apache.org/)
   Getting Started
     Check out & build the code
     Examples in examples/bin
     Wiki (http://mahout.apache.org/)
     Articles & Presentations
      ▪ Grant’s IBM Developerworks Article
        ▪ http://ibm.co/LUbptg (Nov 2011)
      ▪ Others @ http://bit.ly/IZ6PqE (wiki)
   Getting Started
       Check out & build the code
       Examples in examples/bin
       Wiki (http://mahout.apache.org/)
       Articles & Publications (http://bit.ly/IZ6PqE)
       Mailing Lists
        ▪   user-subscribe@mahout.apache.org
        ▪   (http://bit.ly/L1GSHB)
        ▪   dev-subscribe@mahout.apache.org
        ▪   (http://bit.ly/JPeNoE)
   Getting Started
     Check out & build the code
     Examples in examples/bin
     Wiki (http://mahout.apache.org/)
     Articles & Presentations
     Mailing Lists
     Books!
      ▪ Mahout in Action: http://bit.ly/IWMvaz
      ▪ Taming Text: http://bit.ly/KkODZV
   Kicking the Tires in examples/bin
     classify-20newsgroups.sh
     cluster-reuters.sh
     cluster-syntheticcontrol.sh
     asf-email-examples.sh
   Kicking the Tires in examples/bin
     classify-20newsgroups.sh
     Premise: Classify News Stories
     Algorithm: sgd
       Data: http://people.csail.mit.edu/jrennie/20Newsgroups/20news-
        bydate.tar.gz
   Kicking the Tires in examples/bin
     cluster-reuters.sh
     Premise: Group Related News Stories
       Data: http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
   Kicking the Tires in examples/bin
     cluster-syntheticcontrol.sh
        ▪ Premise: Cluster time series data
            ▪ normal, cyclic, increasing, decreasing, upward, downward shift
        ▪ Algorithms:
            ▪ canopy, kmeans, fuzzykmeans, dirichlet, meanshift


       See: https://cwiki.apache.org/MAHOUT/clustering-of-synthetic-control-data.html
       Data: http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html
   Kicking the Tires in examples/bin
     asf-email-examples.sh
      ▪ Recommendation (user based)
      ▪ Clustering (kmeans, dirichlet, minhash)
      ▪ Classification (naïve bayes, sgd)
   General Outline:
     Data Transformation
      ▪ From Native format to…
      ▪ ..Sequence Files; Typed Key, Value pairs
      ▪ ..Labeled Vectors
   General Outline:
     Data Transformation
      ▪ From Native format to…
      ▪ ..Sequence Files; Typed Key, Value pairs
      ▪ ..Labeled Vectors
     Model Training
   General Outline:
     Data Transformation
      ▪ From Native format to…
      ▪ ..Sequence Files; Typed Key, Value pairs
      ▪ ..Labeled Vectors
     Model Training
     Model Evaluation
   General Outline:
     Data Transformation
      ▪ From Native format to…
      ▪ ..Sequence Files; Typed Key, Value pairs
      ▪ ..Labeled Vectors
     Model Training
     Model Evaluation
     Lather, Rinse, Repeat
   General Outline:
     Data Transformation
        ▪ From Native format to…
        ▪ ..Sequence Files; Typed Key, Value pairs
        ▪ ..Labeled Vectors
       Model Training
       Model Evaluation
       Lather, Rinse, Repeat
       Production
   General Outline:
     Data Transformation
        ▪ From Native format to…
        ▪ ..Sequence Files; Typed Key, Value pairs
        ▪ ..Labeled Vectors
       Model Training
       Model Evaluation
       Lather, Rinse, Repeat
       Production
       Lather, Rinse, Repeat
   mahout seq2sparse
     Tokenize Documents
     Count Words
     Make Partial/Merge Vectors
     TFIDF
     Make Partial/Merge TFIDF Vectors
   View Sequence Files with:
       mahout seqdumper –i /path/to/sequence/file

   Check out shortcuts in:
       src/conf/driver.classes.props


   Run classes with:
       mahout org.apache.mahout.SomeCoolNewFeature …

   Standalone vs. Distributed
     Standalone mode is default
     Set HADOOP_CONF_DIR to use Hadoop
     MAHOUT_LOCAL will force standalone
   asf-email-examples.sh (recommendation)
   Premise: Recommend Interesting Threads
   User based recommendation
   Boolean preferences based on thread contribution
     Implies boolean similarity measure – tanimoto, log-likelihood




   See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
   Recommendation Steps
     Convert Mail to Sequence Files
     Convert Sequence Files to Preferences
     Prepare Preference Matrix
     Row Similarity Job
     Recommender Job




   See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
   asf-email-examples.sh (classification)
   Premise: Predict project mailing lists for incoming messages
   Data labeled based on the mailing list it arrived on
   Hold back a random 20% of data for testing, the rest for
    training.
   Algorithms: Naïve Bayes (Standard, Complimentary), SGD



   See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
   Classification Steps
     Convert Mail to Sequence Files
     Sequence Files to Sparse Vectors
     Modify Sequence File Labels
     Split into Training and Test Sets
     Train the Model
     Test the Model


   See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
   asf-email-examples.sh (clustering)
   Premise: Grouping Messages by Subject
   Same Prep as Classification
   Different Algorithms: (kmeans, dirichlet, minhash)


     12/05/16 05:16:02 INFO driver.MahoutDriver: Program took 20577398
      ms (Minutes: 342.95663333333334

   See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
   Clustering Steps
     Convert Mail to Sequence Files
     Sequence Files to Sparse Vectors
     Run Clustering (iterate)
     Dump Results
   Insert Bar Camp Style Discussion Here
   Mahout in Action
     Owen, Anil, Dunning and Friedman
     http://bit.ly/IWMvaz


   Taming Text
     Ingersoll, Morton and Farris
     http://bit.ly/KkODZV

More Related Content

What's hot

Apache Mahout
Apache MahoutApache Mahout
Apache MahoutAjit Koti
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopGrant Ingersoll
 
Mahout classification presentation
Mahout classification presentationMahout classification presentation
Mahout classification presentationNaoki Nakatani
 
Introduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache MahoutIntroduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache Mahoutsscdotopen
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Cataldo Musto
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!OSCON Byrum
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using MahoutIMC Institute
 
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)Jee Vang, Ph.D.
 
Introduction to Apache Mahout
Introduction to Apache MahoutIntroduction to Apache Mahout
Introduction to Apache MahoutAman Adhikari
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenderssscdotopen
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseLukas Vlcek
 
Apache Mahout Architecture Overview
Apache Mahout Architecture OverviewApache Mahout Architecture Overview
Apache Mahout Architecture OverviewStefano Dalla Palma
 
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learningjoshwills
 
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahouttanuvir
 

What's hot (20)

Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
 
Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Mahout classification presentation
Mahout classification presentationMahout classification presentation
Mahout classification presentation
 
Introduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache MahoutIntroduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache Mahout
 
Mahout
MahoutMahout
Mahout
 
Intro to Apache Mahout
Intro to Apache MahoutIntro to Apache Mahout
Intro to Apache Mahout
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using Mahout
 
Apache mahout
Apache mahoutApache mahout
Apache mahout
 
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
A Quick Tutorial on Mahout’s Recommendation Engine (v 0.4)
 
mahout introduction
mahout  introductionmahout  introduction
mahout introduction
 
Introduction to Apache Mahout
Introduction to Apache MahoutIntroduction to Apache Mahout
Introduction to Apache Mahout
 
Mahout part2
Mahout part2Mahout part2
Mahout part2
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenders
 
An Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBaseAn Introduction to Apache Hadoop, Mahout and HBase
An Introduction to Apache Hadoop, Mahout and HBase
 
Apache Mahout Architecture Overview
Apache Mahout Architecture OverviewApache Mahout Architecture Overview
Apache Mahout Architecture Overview
 
Hadoop and Machine Learning
Hadoop and Machine LearningHadoop and Machine Learning
Hadoop and Machine Learning
 
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahout
 

Viewers also liked

Histogram-weighted cortical thickness networks for the detection of Alzheimer...
Histogram-weighted cortical thickness networks for the detection of Alzheimer...Histogram-weighted cortical thickness networks for the detection of Alzheimer...
Histogram-weighted cortical thickness networks for the detection of Alzheimer...Pradeep Redddy Raamana
 
Shai Avidan's Support vector tracking and ensemble tracking
Shai Avidan's Support vector tracking and ensemble trackingShai Avidan's Support vector tracking and ensemble tracking
Shai Avidan's Support vector tracking and ensemble trackingwolf
 
Decision Forests and discriminant analysis
Decision Forests and discriminant analysisDecision Forests and discriminant analysis
Decision Forests and discriminant analysispotaters
 
Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.Sergey Karayev
 
Multi Object Tracking | Presentation 1 | ID 103001
Multi Object Tracking | Presentation 1 | ID 103001Multi Object Tracking | Presentation 1 | ID 103001
Multi Object Tracking | Presentation 1 | ID 103001Md. Minhazul Haque
 
A real time automatic eye tracking system for ophthalmology
A real time automatic eye tracking system for ophthalmologyA real time automatic eye tracking system for ophthalmology
A real time automatic eye tracking system for ophthalmologyPrarinya Siritanawan
 

Viewers also liked (6)

Histogram-weighted cortical thickness networks for the detection of Alzheimer...
Histogram-weighted cortical thickness networks for the detection of Alzheimer...Histogram-weighted cortical thickness networks for the detection of Alzheimer...
Histogram-weighted cortical thickness networks for the detection of Alzheimer...
 
Shai Avidan's Support vector tracking and ensemble tracking
Shai Avidan's Support vector tracking and ensemble trackingShai Avidan's Support vector tracking and ensemble tracking
Shai Avidan's Support vector tracking and ensemble tracking
 
Decision Forests and discriminant analysis
Decision Forests and discriminant analysisDecision Forests and discriminant analysis
Decision Forests and discriminant analysis
 
Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.Attentional Object Detection - introductory slides.
Attentional Object Detection - introductory slides.
 
Multi Object Tracking | Presentation 1 | ID 103001
Multi Object Tracking | Presentation 1 | ID 103001Multi Object Tracking | Presentation 1 | ID 103001
Multi Object Tracking | Presentation 1 | ID 103001
 
A real time automatic eye tracking system for ophthalmology
A real time automatic eye tracking system for ophthalmologyA real time automatic eye tracking system for ophthalmology
A real time automatic eye tracking system for ophthalmology
 

Similar to Mahout Introduction BarCampDC

Collaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsCollaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsNavisro Analytics
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahoutaneeshabakharia
 
JCache data store for Apache Gora
JCache data store for Apache GoraJCache data store for Apache Gora
JCache data store for Apache GoraKevin Ratnasekera
 
Python & Django TTT
Python & Django TTTPython & Django TTT
Python & Django TTTkevinvw
 
Backbone the Good Parts
Backbone the Good PartsBackbone the Good Parts
Backbone the Good PartsRenan Carvalho
 
1.6 米嘉 gobuildweb
1.6 米嘉 gobuildweb1.6 米嘉 gobuildweb
1.6 米嘉 gobuildwebLeo Zhou
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsSadayuki Furuhashi
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Julian Hyde
 
Put a little Backbone in your WordPress vs. 3
Put a little Backbone in your WordPress vs. 3Put a little Backbone in your WordPress vs. 3
Put a little Backbone in your WordPress vs. 3adamsilverstein
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataRahul Jain
 
Web Development with Python and Django
Web Development with Python and DjangoWeb Development with Python and Django
Web Development with Python and DjangoMichael Pirnat
 
Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Henry S
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalystdwm042
 
Optimizing CakePHP 2.x Apps
Optimizing CakePHP 2.x AppsOptimizing CakePHP 2.x Apps
Optimizing CakePHP 2.x AppsJuan Basso
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJANicolas Poggi
 

Similar to Mahout Introduction BarCampDC (20)

Collaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro AnalyticsCollaborative Filtering and Recommender Systems By Navisro Analytics
Collaborative Filtering and Recommender Systems By Navisro Analytics
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
 
JCache data store for Apache Gora
JCache data store for Apache GoraJCache data store for Apache Gora
JCache data store for Apache Gora
 
Python & Django TTT
Python & Django TTTPython & Django TTT
Python & Django TTT
 
NYC_2016_slides
NYC_2016_slidesNYC_2016_slides
NYC_2016_slides
 
Java Basics
Java BasicsJava Basics
Java Basics
 
Backbone the Good Parts
Backbone the Good PartsBackbone the Good Parts
Backbone the Good Parts
 
1.6 米嘉 gobuildweb
1.6 米嘉 gobuildweb1.6 米嘉 gobuildweb
1.6 米嘉 gobuildweb
 
Plugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGemsPlugin-based software design with Ruby and RubyGems
Plugin-based software design with Ruby and RubyGems
 
Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)Apache Calcite (a tutorial given at BOSS '21)
Apache Calcite (a tutorial given at BOSS '21)
 
Put a little Backbone in your WordPress vs. 3
Put a little Backbone in your WordPress vs. 3Put a little Backbone in your WordPress vs. 3
Put a little Backbone in your WordPress vs. 3
 
Emerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big DataEmerging technologies /frameworks in Big Data
Emerging technologies /frameworks in Big Data
 
Sinatra
SinatraSinatra
Sinatra
 
Intro ror
Intro rorIntro ror
Intro ror
 
Web Development with Python and Django
Web Development with Python and DjangoWeb Development with Python and Django
Web Development with Python and Django
 
Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1Code for Startup MVP (Ruby on Rails) Session 1
Code for Startup MVP (Ruby on Rails) Session 1
 
Practical catalyst
Practical catalystPractical catalyst
Practical catalyst
 
Optimizing CakePHP 2.x Apps
Optimizing CakePHP 2.x AppsOptimizing CakePHP 2.x Apps
Optimizing CakePHP 2.x Apps
 
sudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJAsudoers: Benchmarking Hadoop with ALOJA
sudoers: Benchmarking Hadoop with ALOJA
 
Scaling 101 test
Scaling 101 testScaling 101 test
Scaling 101 test
 

Recently uploaded

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 

Recently uploaded (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 

Mahout Introduction BarCampDC

  • 2. Drew Farris  Committer to Apache Mahout since 2/2010  ..not as active in the past year   Author: Taming Text  My Company: (and BarCamp DC Sponsor)
  • 3. Mahout (as in hoot) or Mahout (as in trout)?  A scalable machine learning library
  • 4. A scalable machine learning library  ‘large’ data sets  Often Hadoop  ..but sometimes not
  • 5. A scalable machine learning library  Recommendation Mining
  • 6. A scalable machine learning library  Recommendation Mining  Clustering
  • 7. A scalable machine learning library  Recommendation Mining  Clustering  Classification
  • 8. A scalable machine learning library  Recommendation Mining  Clustering  Classification  Association Mining
  • 9. A scalable machine learning library  Recommendation Mining  Clustering  Classification  Association Mining  A reasonable linear algebra library  A reasonable library of collections
  • 10. A scalable machine learning library  Recommendation Mining  Clustering  Classification  Association Mining  A reasonable linear algebra library  A reasonable library of collections  Other Stuff
  • 11. Getting Started  Check out & build the code ▪ git clone git://git.apache.org/mahout.git ▪ mvn install –DskipTests=true ▪ The tests take a looong time to run, not needed for intial build  Or use the Cloudera Virtual Machine (http://bit.ly/MyBnFi)
  • 12. Getting Started  Check out & build the code  Examples in examples/bin
  • 13. Getting Started  Check out & build the code  Examples in examples/bin  Wiki (http://mahout.apache.org/)
  • 14. Getting Started  Check out & build the code  Examples in examples/bin  Wiki (http://mahout.apache.org/)  Articles & Presentations ▪ Grant’s IBM Developerworks Article ▪ http://ibm.co/LUbptg (Nov 2011) ▪ Others @ http://bit.ly/IZ6PqE (wiki)
  • 15. Getting Started  Check out & build the code  Examples in examples/bin  Wiki (http://mahout.apache.org/)  Articles & Publications (http://bit.ly/IZ6PqE)  Mailing Lists ▪ user-subscribe@mahout.apache.org ▪ (http://bit.ly/L1GSHB) ▪ dev-subscribe@mahout.apache.org ▪ (http://bit.ly/JPeNoE)
  • 16. Getting Started  Check out & build the code  Examples in examples/bin  Wiki (http://mahout.apache.org/)  Articles & Presentations  Mailing Lists  Books! ▪ Mahout in Action: http://bit.ly/IWMvaz ▪ Taming Text: http://bit.ly/KkODZV
  • 17. Kicking the Tires in examples/bin  classify-20newsgroups.sh  cluster-reuters.sh  cluster-syntheticcontrol.sh  asf-email-examples.sh
  • 18. Kicking the Tires in examples/bin  classify-20newsgroups.sh  Premise: Classify News Stories  Algorithm: sgd  Data: http://people.csail.mit.edu/jrennie/20Newsgroups/20news- bydate.tar.gz
  • 19. Kicking the Tires in examples/bin  cluster-reuters.sh  Premise: Group Related News Stories  Data: http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.tar.gz
  • 20. Kicking the Tires in examples/bin  cluster-syntheticcontrol.sh ▪ Premise: Cluster time series data ▪ normal, cyclic, increasing, decreasing, upward, downward shift ▪ Algorithms: ▪ canopy, kmeans, fuzzykmeans, dirichlet, meanshift  See: https://cwiki.apache.org/MAHOUT/clustering-of-synthetic-control-data.html  Data: http://archive.ics.uci.edu/ml/databases/synthetic_control/synthetic_control.data.html
  • 21. Kicking the Tires in examples/bin  asf-email-examples.sh ▪ Recommendation (user based) ▪ Clustering (kmeans, dirichlet, minhash) ▪ Classification (naïve bayes, sgd)
  • 22. General Outline:  Data Transformation ▪ From Native format to… ▪ ..Sequence Files; Typed Key, Value pairs ▪ ..Labeled Vectors
  • 23. General Outline:  Data Transformation ▪ From Native format to… ▪ ..Sequence Files; Typed Key, Value pairs ▪ ..Labeled Vectors  Model Training
  • 24. General Outline:  Data Transformation ▪ From Native format to… ▪ ..Sequence Files; Typed Key, Value pairs ▪ ..Labeled Vectors  Model Training  Model Evaluation
  • 25. General Outline:  Data Transformation ▪ From Native format to… ▪ ..Sequence Files; Typed Key, Value pairs ▪ ..Labeled Vectors  Model Training  Model Evaluation  Lather, Rinse, Repeat
  • 26. General Outline:  Data Transformation ▪ From Native format to… ▪ ..Sequence Files; Typed Key, Value pairs ▪ ..Labeled Vectors  Model Training  Model Evaluation  Lather, Rinse, Repeat  Production
  • 27. General Outline:  Data Transformation ▪ From Native format to… ▪ ..Sequence Files; Typed Key, Value pairs ▪ ..Labeled Vectors  Model Training  Model Evaluation  Lather, Rinse, Repeat  Production  Lather, Rinse, Repeat
  • 28. mahout seq2sparse  Tokenize Documents  Count Words  Make Partial/Merge Vectors  TFIDF  Make Partial/Merge TFIDF Vectors
  • 29. View Sequence Files with:  mahout seqdumper –i /path/to/sequence/file  Check out shortcuts in:  src/conf/driver.classes.props  Run classes with:  mahout org.apache.mahout.SomeCoolNewFeature …  Standalone vs. Distributed  Standalone mode is default  Set HADOOP_CONF_DIR to use Hadoop  MAHOUT_LOCAL will force standalone
  • 30. asf-email-examples.sh (recommendation)  Premise: Recommend Interesting Threads  User based recommendation  Boolean preferences based on thread contribution  Implies boolean similarity measure – tanimoto, log-likelihood  See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
  • 31. Recommendation Steps  Convert Mail to Sequence Files  Convert Sequence Files to Preferences  Prepare Preference Matrix  Row Similarity Job  Recommender Job  See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
  • 32. asf-email-examples.sh (classification)  Premise: Predict project mailing lists for incoming messages  Data labeled based on the mailing list it arrived on  Hold back a random 20% of data for testing, the rest for training.  Algorithms: Naïve Bayes (Standard, Complimentary), SGD  See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
  • 33. Classification Steps  Convert Mail to Sequence Files  Sequence Files to Sparse Vectors  Modify Sequence File Labels  Split into Training and Test Sets  Train the Model  Test the Model  See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
  • 34. asf-email-examples.sh (clustering)  Premise: Grouping Messages by Subject  Same Prep as Classification  Different Algorithms: (kmeans, dirichlet, minhash)  12/05/16 05:16:02 INFO driver.MahoutDriver: Program took 20577398 ms (Minutes: 342.95663333333334  See: http://www.ibm.com/developerworks/java/library/j-mahout-scaling/
  • 35. Clustering Steps  Convert Mail to Sequence Files  Sequence Files to Sparse Vectors  Run Clustering (iterate)  Dump Results
  • 36. Insert Bar Camp Style Discussion Here
  • 37. Mahout in Action  Owen, Anil, Dunning and Friedman  http://bit.ly/IWMvaz  Taming Text  Ingersoll, Morton and Farris  http://bit.ly/KkODZV

Editor's Notes

  1. We encounter recommendations everywhere today, from books, to music to people.
  2. Clustering combines related items into groups, like text documents organized by topic.
  3. Classification is assigning classes or categories to new data based on what we know about existing data.
  4. Identifying items that frequently appear together, whether it be shopping cart contents or frequently co-occuring terms.
  5. It’s not the fastest linear algebra library, but it’s high performance, and uses a reasonably small memory footprint. Based upon COLT from CERN.It’s not the fastest collections library, but implements collections of primitive types that use open addressing. Fundamental stuff that’s missing from java.util and things that weren’t previously available in a commercial friendly license.
  6. It’s not the fastest linear algebra library, but it’s high performance, and uses a reasonably small memory footprint. Based upon COLT from CERN.It’s not the fastest collections library, but implements collections of primitive types that use open addressing. Fundamental stuff that’s missing from java.util and things that weren’t previously available in a commercial friendly license.
  7. Modify sequence file labels