Contenu connexe
Similaire à はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
Similaire à はじめてでもわかるベイズ分類器 -基礎からMahout実装まで- (20)
はじめてでもわかるベイズ分類器 -基礎からMahout実装まで-
- 1. Mahout
2010/09/26 #TokyoWebmining 7
naoki yanai
@yanaoki
2010 9 26
- 2. LL Ruby
Mahout Java/Hadoop
2010 9 26
- 3. yanaoki
Web Ruby Java
#TokyoWebmining
2010 9 26
- 5. P(B) = B
P(B|A) = A B
P(A) > 0
2010 9 26
- 7. 1.
(a)
2.
(a)
3.
(a)
(b) 1. →2. →3.
(c)
4.
(a)
2010 9 26
- 11. 1.
(a)
2.
(a)
3.
(a)
(b) 1. →2. →3.
(c)
4.
(a)
2010 9 26
- 12. LL
#
c = NaiveBayes.new
# ‘good’,‘bad’
c.train('Nobody owns the water.','good')
c.train('the quick rabbit jumps fences','good')
c.train('buy pharmaceuticals now','bad')
c.train('make quick money at the online casino','bad')
c.train('the quick brown fox jumps','good')
#
c.classify("quick rabbit",default="unknown") #=> good
c.classify("quick money",default="unknown") #=> bad
#
c.setthreshold('bad',3.0) # bad 3
c.classify("quick money",default="unknown") #=> unknown
2010 9 26
- 13. Mahout
Java/Hadoop
Classifier
CF
Clustering
Pattern Mining
2010 9 26
- 14. Mahout
Mahout in Action
MEAP PDF
2010 9 26
- 17. Mahout
Naive Bayes
Complementary Naive Bayes
Tackling the Poor Assumptions of Naive Bayes Text Classifiers
http://people.csail.mit.edu/jrennie/papers/icml03-nb.pdf
Hadoop
Hadoop
Elastic MapReduce
Wikipedia 20NewsGroups Example
2010 9 26
- 18. 1.
(a)
2.
(a)
3.
(a)
(b) 1. →2. →3.
(c)
4.
(a)
2010 9 26
- 19. Mahout
1.
Wikipedia(en) xml.bz2)
xml→*.xml
64MB 396 25GB
$ java -Xmx2048m org.apache.mahout.classifier.bayes.WikipediaXmlSplitter
-d $WORK_DIR/enwiki-20100130-pages-articles.xml.bz2
-o $WORK_DIR/chunks/
-c 64
2010 9 26
- 20. Mahout
1.
S3(MapReduce input
Sports Game
Hadoop small,maste1/slave2
$ elastic-mapreduce
--create
--name "wikipedia classifier"
--alive
--log-uri s3:/ /yanaokimrsample/classifier/wikipedia/logs
--num-instances 3
--instance-type m1.small
--availability-zone us-west-1a
2010 9 26
- 21. Mahout
1.
$ elastic-mapreduce -j j-24EIFIRXBO39M
--jar s3n://yanaokimrsample/jars/mahout-examples-0.3.job
--main-class org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver
--arg --input --arg s3n://yanaokimrsample/classifier/wikipedia/chunks
--arg --output --arg s3n://yanaokimrsample/classifier/wikipedia/input
--arg --categories --arg /home/hadoop/cat.txt
--step-name "wikipedia train dataset creator driver"
Wikipedia
→ →
↓
<¥t> <space> <spache> <spane>...
2010 9 26
- 22. Mahout
1.
$ elastic-mapreduce -j j-24EIFIRXBO39M
--jar s3n://yanaokimrsample/jars/mahout-examples-0.3.job
--main-class org.apache.mahout.classifier.bayes.WikipediaDatasetCreatorDriver
--arg --input --arg s3n://yanaokimrsample/classifier/wikipedia/chunks2
--arg --output --arg s3n://yanaokimrsample/classifier/wikipedia/input2
--arg --categories --arg /home/hadoop/cat.txt
--step-name "wikipedia test dataset creator driver"
2010 9 26
- 23. Mahout
2.
N-gram N-gram alpha
Bayes/CBayes MapReduce
$ elastic-mapreduce -j j-24EIFIRXBO39M
--jar s3n://yanaokimrsample/jars/mahout-core-0.3.job
--main-class org.apache.mahout.classifier.bayes.TrainClassifier
--arg --input --arg s3n://yanaokimrsample/classifier/wikipedia/input
--arg --output --arg s3n://yanaokimrsample/classifier/wikipedia/model
--arg --gramSize --arg 2
--arg --classifierType --arg bayes
--arg --alpha --arg 0.5
--arg --dataSource --arg hdfs
--step-name "train classifier"
2010 9 26
- 24. Mahout
2.
N-gram alpha
elastic-mapreduce -j j-24EIFIRXBO39M
--jar s3n://yanaokimrsample/jars/mahout-core-0.3.job
--main-class org.apache.mahout.classifier.bayes.TestClassifier
--arg --model --arg s3n:/ /yanaokimrsample/classifier/wikipedia/model
--arg --testDir --arg s3n://yanaokimrsample/classifier/wikipedia/input2
--arg --gramSize --arg 2
--arg --classifierType --arg bayes
--arg --alpha --arg 0.5
--arg --method --arg mapreduce
--arg --dataSource --arg hdfs
--arg --encoding --arg UTF-8
--step-name "test classifier"
2010 9 26
- 25. Mahout
2.
=======================================================
Confusion Matrix
-------------------------------------------------------
a b <--Classified as
464 2 | 466 a = game
159 169 | 328 b = sports
Default Category: unknown: 2
→sports
2010 9 26
- 26. Mahout
2.
Classifier Mahout
HBase
orz
2010 9 26
- 27. Hadoop
Mahout
N-gram alpha
Mahout in Action
2010 9 26