SlideShare une entreprise Scribd logo
1  sur  30
Télécharger pour lire hors ligne
Classification on Mahout
Naoki Nakatani
San Jose State University
CS185C Spring 2014
Agenda
● Classification Overview
● Mahout Overview
○ Classification on Mahout
● Case Study with Demo
○ Problem Description
○ Working Environment
○ Data Preparation
○ ML Model Generation
Classification?
● Classifying examples into given set of categories
● Supervised learning
○ Prepare data
○ Build classifier (train & test)
○ Apply classifier to new data
http://www.ndm.net/opentext/images/stories/images/extraction_cmyk_thumb.jpg
Mahout?
● Scalable machine learning
library
= Can handle Big Data
● Runs on HDFS
● Classification, Clustering,
Collaborative Filtering , etc
http://www.robinanil.com/wp-content/uploads/2010/03/mahout-logo-200.png
Classification on Mahout?
Classifying examples into given set of categories
Scalable machine learning library that can handle big data
Classifying big data into given set of categories
Case Study & Demo
Given question with title and body, can we
automatically generate tags for it?
Where can I find the
LaTeX3 manual?
Few month ago I saw a big pdf-manual of all
LaTeX3-packages and the new syntax. I think
it was bigger than 300 pages. I can't find it on
the web.
Does anyone have a link?
Documentation
latex3
expl3
Dataset
File :
● TrainSmall.tsv
Fields :
● id, title, body, tags
Characteristics :
● Each question contains
only one tag
0
“----” , ”-----------” , “------------------------” , “--- --- --- ---
”
0
0
“----” , ”-----------” , “------------------------” , “--- --- --- ---
”
“----” , ”-----------” , “------------------------” , “--- --- --- ---
”
Working Environment
● Mac OS 10.9.1
● Eclipse 4.3.2
● Hadoop 1.2.1
● Mahout 0.9
● Source code available here.
Prerequisite (Where are you?)
● You have input tsv file at result > output-topfivetags.
● You are at “result” directory in Terminal.
● Command “hadoop” and “mahout” is working.
Prepare Data
1. Convert TSV file to Hadoop sequence file format.
Specify tag as a category. (Run TSVToSeq.java)
output-tsvtoseq folder
and chunk-0 file is
created.
Prepare Data
1. Make directory in HDFS and upload chunk-0 (sequence
file) to the folder.
hadoop fs -mkdir <directory>
hadoop fs -put <source> <destination>
Prepare Data
2. Transform questions into vectors. (mahout seq2sparse)
mahout seq2sparse -i <input directory> -o <output directory>
Prepare Data
3. Split data into
a. Train set : to train model
b. Test set : to test model
mahout split 
-i <input directory> 
--trainingOutput <output dir to train> 
--testOutput <output dir to test> 
--randomSelectionPct <integer> 
--overwrite 
--sequenceFiles 
-xm sequential
Build Classifier
1. Choose algorithm to use for classification
Available algorithms:
○ Naive Bayes
■ trainnb, testnb
■ org.apache.mahout.
classifier.naivebayes
○ Hidden Markov Model
■ baumwelch, hmmpredict
■ org.apache.mahout.
classifier.sequencelearning.
hmm
○ Logistic Regression
■ trainlogistic, testlogistic
■ org.apache.mahout.
classifier.sgd
○ Random Forest
■ ?
■ ?
2. Train & test model using train set
Should yield high accuracy
Build Classifier (Naive Bayes)
mahout trainnb 
-i <dir to train vectors> 
-el 
-li <dir to put label index> 
-o <dir to put model> 
-ow 
-c
mahout testnb 
-i <dir to train vectors> 
-m <dir to model> 
-l <dir to label index> 
-ow 
-o <output dir> 
-c
Build Classifier (Naive Bayes)
3. Test model using test set
Check if the accuracy is satisfactory
Apply Classifier
What do you have at this point?
● model
● label index
You can start classifying new data! (Check this example)
Model
Label Index
References
● Using the Mahout Naive Bayes Classifier to automatically classify Twitter
messages
● Using the Mahout Naive Bayes Classifier to automatically classify Twitter
messages (part 2: distribute classification with hadoop)
Happy Machine Learning!

Contenu connexe

Tendances

Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopGrant Ingersoll
 
Tutorial Mahout - Recommendation
Tutorial Mahout - RecommendationTutorial Mahout - Recommendation
Tutorial Mahout - RecommendationCataldo Musto
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDCDrew Farris
 
Apache Mahout Architecture Overview
Apache Mahout Architecture OverviewApache Mahout Architecture Overview
Apache Mahout Architecture OverviewStefano Dalla Palma
 
Introduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache MahoutIntroduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache Mahoutsscdotopen
 
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahouttanuvir
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用James Chen
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsDataStax Academy
 
Apache Mahout
Apache MahoutApache Mahout
Apache MahoutAjit Koti
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Cataldo Musto
 
Introduction to Apache Mahout
Introduction to Apache MahoutIntroduction to Apache Mahout
Introduction to Apache MahoutAman Adhikari
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenderssscdotopen
 
Random forest using apache mahout
Random forest using apache mahoutRandom forest using apache mahout
Random forest using apache mahoutGaurav Kasliwal
 
OSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningOSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningRobin Anil
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...Sebastian Raschka
 

Tendances (20)

Intro to Mahout -- DC Hadoop
Intro to Mahout -- DC HadoopIntro to Mahout -- DC Hadoop
Intro to Mahout -- DC Hadoop
 
Tutorial Mahout - Recommendation
Tutorial Mahout - RecommendationTutorial Mahout - Recommendation
Tutorial Mahout - Recommendation
 
Mahout Introduction BarCampDC
Mahout Introduction BarCampDCMahout Introduction BarCampDC
Mahout Introduction BarCampDC
 
Apache Mahout Architecture Overview
Apache Mahout Architecture OverviewApache Mahout Architecture Overview
Apache Mahout Architecture Overview
 
Introduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache MahoutIntroduction to Collaborative Filtering with Apache Mahout
Introduction to Collaborative Filtering with Apache Mahout
 
Logistic Regression using Mahout
Logistic Regression using MahoutLogistic Regression using Mahout
Logistic Regression using Mahout
 
Intro to Apache Mahout
Intro to Apache MahoutIntro to Apache Mahout
Intro to Apache Mahout
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
 
Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用Apache Mahout 於電子商務的應用
Apache Mahout 於電子商務的應用
 
Apache mahout
Apache mahoutApache mahout
Apache mahout
 
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data PlatformsCassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
Cassandra Summit 2014: Apache Spark - The SDK for All Big Data Platforms
 
Apache Mahout
Apache MahoutApache Mahout
Apache Mahout
 
Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014 Apache Mahout Tutorial - Recommendation - 2013/2014
Apache Mahout Tutorial - Recommendation - 2013/2014
 
Mahout
MahoutMahout
Mahout
 
Introduction to Apache Mahout
Introduction to Apache MahoutIntroduction to Apache Mahout
Introduction to Apache Mahout
 
mahout introduction
mahout  introductionmahout  introduction
mahout introduction
 
Next directions in Mahout's recommenders
Next directions in Mahout's recommendersNext directions in Mahout's recommenders
Next directions in Mahout's recommenders
 
Random forest using apache mahout
Random forest using apache mahoutRandom forest using apache mahout
Random forest using apache mahout
 
OSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine LearningOSCON: Apache Mahout - Mammoth Scale Machine Learning
OSCON: Apache Mahout - Mammoth Scale Machine Learning
 
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...An Introduction to Supervised Machine Learning and Pattern Classification: Th...
An Introduction to Supervised Machine Learning and Pattern Classification: Th...
 

En vedette

Machine intelligente d’analyse financiere
Machine intelligente d’analyse financiereMachine intelligente d’analyse financiere
Machine intelligente d’analyse financiereSabrine MASTOURA
 
Machine learning, deep learning et search : à quand ces innovations dans nos ...
Machine learning, deep learning et search : à quand ces innovations dans nos ...Machine learning, deep learning et search : à quand ces innovations dans nos ...
Machine learning, deep learning et search : à quand ces innovations dans nos ...Antidot
 
Machine learning pour tous
Machine learning pour tousMachine learning pour tous
Machine learning pour tousDamien Seguy
 
Apprentissage Automatique et moteurs de recherche
Apprentissage Automatique et moteurs de rechercheApprentissage Automatique et moteurs de recherche
Apprentissage Automatique et moteurs de recherchePhilippe YONNET
 
Mix it2014 - Machine Learning et Régulation Numérique
Mix it2014 - Machine Learning et Régulation NumériqueMix it2014 - Machine Learning et Régulation Numérique
Mix it2014 - Machine Learning et Régulation NumériqueDidier Girard
 
Machine learning
Machine learningMachine learning
Machine learningebiznext
 
Introduction au Machine Learning
Introduction au Machine LearningIntroduction au Machine Learning
Introduction au Machine LearningMathieu Goeminne
 
Analyse financière
Analyse financièreAnalyse financière
Analyse financièreAbdo attar
 
Ia project Apprentissage Automatique
Ia project Apprentissage AutomatiqueIa project Apprentissage Automatique
Ia project Apprentissage AutomatiqueNizar Bechir
 
Cours Big Data Chap4 - Spark
Cours Big Data Chap4 - SparkCours Big Data Chap4 - Spark
Cours Big Data Chap4 - SparkAmal Abid
 
TP2 Big Data HBase
TP2 Big Data HBaseTP2 Big Data HBase
TP2 Big Data HBaseAmal Abid
 
Cours Big Data Chap1
Cours Big Data Chap1Cours Big Data Chap1
Cours Big Data Chap1Amal Abid
 

En vedette (13)

Machine intelligente d’analyse financiere
Machine intelligente d’analyse financiereMachine intelligente d’analyse financiere
Machine intelligente d’analyse financiere
 
Machine learning, deep learning et search : à quand ces innovations dans nos ...
Machine learning, deep learning et search : à quand ces innovations dans nos ...Machine learning, deep learning et search : à quand ces innovations dans nos ...
Machine learning, deep learning et search : à quand ces innovations dans nos ...
 
Machine learning pour tous
Machine learning pour tousMachine learning pour tous
Machine learning pour tous
 
Apprentissage Automatique et moteurs de recherche
Apprentissage Automatique et moteurs de rechercheApprentissage Automatique et moteurs de recherche
Apprentissage Automatique et moteurs de recherche
 
Mix it2014 - Machine Learning et Régulation Numérique
Mix it2014 - Machine Learning et Régulation NumériqueMix it2014 - Machine Learning et Régulation Numérique
Mix it2014 - Machine Learning et Régulation Numérique
 
Mahout clustering
Mahout clusteringMahout clustering
Mahout clustering
 
Machine learning
Machine learningMachine learning
Machine learning
 
Introduction au Machine Learning
Introduction au Machine LearningIntroduction au Machine Learning
Introduction au Machine Learning
 
Analyse financière
Analyse financièreAnalyse financière
Analyse financière
 
Ia project Apprentissage Automatique
Ia project Apprentissage AutomatiqueIa project Apprentissage Automatique
Ia project Apprentissage Automatique
 
Cours Big Data Chap4 - Spark
Cours Big Data Chap4 - SparkCours Big Data Chap4 - Spark
Cours Big Data Chap4 - Spark
 
TP2 Big Data HBase
TP2 Big Data HBaseTP2 Big Data HBase
TP2 Big Data HBase
 
Cours Big Data Chap1
Cours Big Data Chap1Cours Big Data Chap1
Cours Big Data Chap1
 

Similaire à Mahout classification presentation

Applying soft computing techniques to corporate mobile security systems
Applying soft computing techniques to corporate mobile security systemsApplying soft computing techniques to corporate mobile security systems
Applying soft computing techniques to corporate mobile security systemsPaloma De Las Cuevas
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Holden Karau
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using MahoutIMC Institute
 
Fast and Reproducible Deep Learning
Fast and Reproducible Deep LearningFast and Reproducible Deep Learning
Fast and Reproducible Deep LearningGreg Gandenberger
 
StackNet Meta-Modelling framework
StackNet Meta-Modelling frameworkStackNet Meta-Modelling framework
StackNet Meta-Modelling frameworkSri Ambati
 
0_ML lab programs updated123 (3).1687933796093.doc
0_ML lab programs updated123 (3).1687933796093.doc0_ML lab programs updated123 (3).1687933796093.doc
0_ML lab programs updated123 (3).1687933796093.docRohanS38
 
Classification with Naive Bayes
Classification with Naive BayesClassification with Naive Bayes
Classification with Naive BayesJosh Patterson
 
Drupal 8: frontend development
Drupal 8: frontend developmentDrupal 8: frontend development
Drupal 8: frontend developmentsparkfabrik
 
Data scientist enablement dse 400 week 6 roadmap
Data scientist enablement   dse 400   week 6 roadmapData scientist enablement   dse 400   week 6 roadmap
Data scientist enablement dse 400 week 6 roadmapDr. Mohan K. Bavirisetty
 
Advanced WhizzML Workflows
Advanced WhizzML WorkflowsAdvanced WhizzML Workflows
Advanced WhizzML WorkflowsBigML, Inc
 
Lecture2 big data life cycle
Lecture2 big data life cycleLecture2 big data life cycle
Lecture2 big data life cyclehktripathy
 
Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...RINUSATHYAN
 
Analysis using r
Analysis using rAnalysis using r
Analysis using rPriya Mohan
 
Python: Modules and Packages
Python: Modules and PackagesPython: Modules and Packages
Python: Modules and PackagesDamian T. Gordon
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderDatabricks
 
Hands On Intro to Node.js
Hands On Intro to Node.jsHands On Intro to Node.js
Hands On Intro to Node.jsChris Cowan
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple stepsRenjith M P
 
Topic modeling using big data analytics
Topic modeling using big data analyticsTopic modeling using big data analytics
Topic modeling using big data analyticsFarheen Nilofer
 

Similaire à Mahout classification presentation (20)

Applying soft computing techniques to corporate mobile security systems
Applying soft computing techniques to corporate mobile security systemsApplying soft computing techniques to corporate mobile security systems
Applying soft computing techniques to corporate mobile security systems
 
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
Ml pipelines with Apache spark and Apache beam - Ottawa Reactive meetup Augus...
 
Big Data Analytics using Mahout
Big Data Analytics using MahoutBig Data Analytics using Mahout
Big Data Analytics using Mahout
 
Fast and Reproducible Deep Learning
Fast and Reproducible Deep LearningFast and Reproducible Deep Learning
Fast and Reproducible Deep Learning
 
StackNet Meta-Modelling framework
StackNet Meta-Modelling frameworkStackNet Meta-Modelling framework
StackNet Meta-Modelling framework
 
0_ML lab programs updated123 (3).1687933796093.doc
0_ML lab programs updated123 (3).1687933796093.doc0_ML lab programs updated123 (3).1687933796093.doc
0_ML lab programs updated123 (3).1687933796093.doc
 
Classification with Naive Bayes
Classification with Naive BayesClassification with Naive Bayes
Classification with Naive Bayes
 
Drupal 8: frontend development
Drupal 8: frontend developmentDrupal 8: frontend development
Drupal 8: frontend development
 
NYC_2016_slides
NYC_2016_slidesNYC_2016_slides
NYC_2016_slides
 
Data scientist enablement dse 400 week 6 roadmap
Data scientist enablement   dse 400   week 6 roadmapData scientist enablement   dse 400   week 6 roadmap
Data scientist enablement dse 400 week 6 roadmap
 
2.5 lab1
2.5 lab12.5 lab1
2.5 lab1
 
Advanced WhizzML Workflows
Advanced WhizzML WorkflowsAdvanced WhizzML Workflows
Advanced WhizzML Workflows
 
Lecture2 big data life cycle
Lecture2 big data life cycleLecture2 big data life cycle
Lecture2 big data life cycle
 
Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...Frameworks provide structure. The core objective of the Big Data Framework is...
Frameworks provide structure. The core objective of the Big Data Framework is...
 
Analysis using r
Analysis using rAnalysis using r
Analysis using r
 
Python: Modules and Packages
Python: Modules and PackagesPython: Modules and Packages
Python: Modules and Packages
 
Accelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks AutoloaderAccelerating Data Ingestion with Databricks Autoloader
Accelerating Data Ingestion with Databricks Autoloader
 
Hands On Intro to Node.js
Hands On Intro to Node.jsHands On Intro to Node.js
Hands On Intro to Node.js
 
Start machine learning in 5 simple steps
Start machine learning in 5 simple stepsStart machine learning in 5 simple steps
Start machine learning in 5 simple steps
 
Topic modeling using big data analytics
Topic modeling using big data analyticsTopic modeling using big data analytics
Topic modeling using big data analytics
 

Dernier

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 

Dernier (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024AXA XL - Insurer Innovation Award Americas 2024
AXA XL - Insurer Innovation Award Americas 2024
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 

Mahout classification presentation

  • 1. Classification on Mahout Naoki Nakatani San Jose State University CS185C Spring 2014
  • 2. Agenda ● Classification Overview ● Mahout Overview ○ Classification on Mahout ● Case Study with Demo ○ Problem Description ○ Working Environment ○ Data Preparation ○ ML Model Generation
  • 3. Classification? ● Classifying examples into given set of categories ● Supervised learning ○ Prepare data ○ Build classifier (train & test) ○ Apply classifier to new data http://www.ndm.net/opentext/images/stories/images/extraction_cmyk_thumb.jpg
  • 4. Mahout? ● Scalable machine learning library = Can handle Big Data ● Runs on HDFS ● Classification, Clustering, Collaborative Filtering , etc http://www.robinanil.com/wp-content/uploads/2010/03/mahout-logo-200.png
  • 5. Classification on Mahout? Classifying examples into given set of categories Scalable machine learning library that can handle big data Classifying big data into given set of categories
  • 6. Case Study & Demo Given question with title and body, can we automatically generate tags for it? Where can I find the LaTeX3 manual? Few month ago I saw a big pdf-manual of all LaTeX3-packages and the new syntax. I think it was bigger than 300 pages. I can't find it on the web. Does anyone have a link? Documentation latex3 expl3
  • 7. Dataset File : ● TrainSmall.tsv Fields : ● id, title, body, tags Characteristics : ● Each question contains only one tag 0 “----” , ”-----------” , “------------------------” , “--- --- --- --- ” 0 0 “----” , ”-----------” , “------------------------” , “--- --- --- --- ” “----” , ”-----------” , “------------------------” , “--- --- --- --- ”
  • 8. Working Environment ● Mac OS 10.9.1 ● Eclipse 4.3.2 ● Hadoop 1.2.1 ● Mahout 0.9 ● Source code available here.
  • 9. Prerequisite (Where are you?) ● You have input tsv file at result > output-topfivetags. ● You are at “result” directory in Terminal. ● Command “hadoop” and “mahout” is working.
  • 10. Prepare Data 1. Convert TSV file to Hadoop sequence file format. Specify tag as a category. (Run TSVToSeq.java) output-tsvtoseq folder and chunk-0 file is created.
  • 11. Prepare Data 1. Make directory in HDFS and upload chunk-0 (sequence file) to the folder.
  • 12. hadoop fs -mkdir <directory>
  • 13. hadoop fs -put <source> <destination>
  • 14. Prepare Data 2. Transform questions into vectors. (mahout seq2sparse)
  • 15. mahout seq2sparse -i <input directory> -o <output directory>
  • 16.
  • 17. Prepare Data 3. Split data into a. Train set : to train model b. Test set : to test model
  • 18. mahout split -i <input directory> --trainingOutput <output dir to train> --testOutput <output dir to test> --randomSelectionPct <integer> --overwrite --sequenceFiles -xm sequential
  • 19.
  • 20. Build Classifier 1. Choose algorithm to use for classification Available algorithms: ○ Naive Bayes ■ trainnb, testnb ■ org.apache.mahout. classifier.naivebayes ○ Hidden Markov Model ■ baumwelch, hmmpredict ■ org.apache.mahout. classifier.sequencelearning. hmm ○ Logistic Regression ■ trainlogistic, testlogistic ■ org.apache.mahout. classifier.sgd ○ Random Forest ■ ? ■ ?
  • 21. 2. Train & test model using train set Should yield high accuracy Build Classifier (Naive Bayes)
  • 22. mahout trainnb -i <dir to train vectors> -el -li <dir to put label index> -o <dir to put model> -ow -c
  • 23.
  • 24. mahout testnb -i <dir to train vectors> -m <dir to model> -l <dir to label index> -ow -o <output dir> -c
  • 25.
  • 26. Build Classifier (Naive Bayes) 3. Test model using test set Check if the accuracy is satisfactory
  • 27.
  • 28. Apply Classifier What do you have at this point? ● model ● label index You can start classifying new data! (Check this example) Model Label Index
  • 29. References ● Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages ● Using the Mahout Naive Bayes Classifier to automatically classify Twitter messages (part 2: distribute classification with hadoop)