Contenu connexe
Similaire à New directions for mahout (20)
Plus de MapR Technologies (20)
New directions for mahout
- 4. 4©MapR Technologies - Confidential
Bloat is Leaving in 0.7
Lots of abandoned code in Mahout
– average code quality is poor
– no users
– no maintainers
– why do we care?
Examples
– old LDA
– old Naïve Bayes
– genetic algorithms
If you care, get on the mailing list
0.7 is about to be released
- 6. 6©MapR Technologies - Confidential
Nobody Cares about Collections
We need it, math is built on it
Pull it into math
Broke the build (battle of the code expanders)
Fixed now (thanks Grant)
- 8. 8©MapR Technologies - Confidential
What’s that?
Find the k nearest training examples
Use the average value of the target variable from them
This is easy … but hard
– easy because it is so conceptually simple and you don’t have knobs to turn
or models to build
– hard because of the stunning amount of math
– also hard because we need top 50,000 results
Initial prototype was massively too slow
– 3K queries x 200K examples takes hours
– needed 20M x 25M in the same time
- 9. 9©MapR Technologies - Confidential
How We Did It
2 week hackathon with 6 developers from customer bank
Agile-ish development
To avoid IP issues
– all code is Apache Licensed (no ownership question)
– all data is synthetic (no question of private data)
– all development done on individual machines, hosting on Github
– open is easier than closed (in this case)
Goal is new open technology to facilitate new closed solutions
Ambitious goal of ~ 1,000,000 x speedup
- 10. 10©MapR Technologies - Confidential
How We Did It
2 week hackathon with 6 developers from customer bank
Agile-ish development
To avoid IP issues
– all code is Apache Licensed (no ownership question)
– all data is synthetic (no question of private data)
– all development done on individual machines, hosting on Github
– open is easier than closed (in this case)
Goal is new open technology to facilitate new closed solutions
Ambitious goal of ~ 1,000,000 x speedup
– well, really only 100-1000x after basic hygiene
- 11. 11©MapR Technologies - Confidential
What We Did
Mechanism for extending Mahout Vectors
– DelegatingVector, WeightedVector, Centroid
Searcher interface
– ProjectionSearch, KmeansSearch, LshSearch, Brute
Super-fast clustering
– Kmeans, StreamingKmeans
- 14. 14©MapR Technologies - Confidential
K-means Search
Simple Idea
– pre-cluster the data
– to find the nearest points, search the nearest clusters
Recursive application
– to search a cluster, use a Searcher!
- 20. 20©MapR Technologies - Confidential
But This Require k-means!
Need a new k-means algorithm to get speed
– Hadoop is very slow at iterative map-reduce
– Maybe Pregel clones like Giraph would be better
– Or maybe not
Streaming k-means is
– One pass (through the original data)
– Very fast (20 us per data point with threads)
– Very parallelizable
- 21. 21©MapR Technologies - Confidential
How It Works
For each point
– Find approximately nearest centroid (distance = d)
– If d > threshold, new centroid
– Else possibly new cluster
– Else add to nearest centroid
If centroids > K ~ C log N
– Recursively cluster centroids with higher threshold
Result is large set of centroids
– these provide approximation of original distribution
– we can cluster centroids to get a close approximation of clustering original
– or we can just use the result directly
- 22. 22©MapR Technologies - Confidential
Parallel Speedup?
1 2 3 4 5 20
10
100
20
30
40
50
200
Threads
Timeperpoint(μs)
2
3
4
5
6
8
10
12
14
16
Threaded version
Non- threaded
Perfect Scaling
✓
- 23. 23©MapR Technologies - Confidential
Warning, Recursive Descent
Inner loop requires finding nearest centroid
With lots of centroids, this is slow
But wait, we have classes to accelerate that!
- 24. 24©MapR Technologies - Confidential
Warning, Recursive Descent
Inner loop requires finding nearest centroid
With lots of centroids, this is slow
But wait, we have classes to accelerate that!
(Let’s not use k-means searcher, though)
- 26. 26©MapR Technologies - Confidential
What is it?
Supports Pig access to Mahout functions
So far text vectorization
And classification
And model saving
- 27. 27©MapR Technologies - Confidential
What is it?
Supports Pig access to Mahout functions
So far text vectorization
And classification
And model saving
Kind of works (see pigML from twitter for better function)
- 28. 28©MapR Technologies - Confidential
Compile and Install
Start by compiling and installing mahout in your local repository:
cd ~/Apache
git clone https://github.com/apache/mahout.git
cd mahout
mvn install -DskipTests
Then do the same with pig-vector
cd ~/Apache
git clone git@github.com:tdunning/pig-vector.git
cd pig-vector
mvn package
- 29. 29©MapR Technologies - Confidential
Tokenize and Vectorize Text
Tokenized is done using a text encoder
– the dimension of the resulting vectors (typically 100,000-1,000,000
– a description of the variables to be included in the encoding
– the schema of the tuples that pig will pass together with their data types
Example:
define EncodeVector
org.apache.mahout.pig.encoders.EncodeVector
('10','x+y+1', 'x:numeric, y:word, z:text');
You can also add a Lucene 3.1 analyzer in parentheses if you want
something fancier
- 30. 30©MapR Technologies - Confidential
The Formula
Not normal arithmetic
Describes which variables to use, whether offset is included
Also describes which interactions to use
- 31. 31©MapR Technologies - Confidential
The Formula
Not normal arithmetic
Describes which variables to use, whether offset is included
Also describes which interactions to use
– but that doesn’t do anything yet!
- 32. 32©MapR Technologies - Confidential
Load and Encode Data
Load the data
a = load '/Users/tdunning/Downloads/NNBench.csv' using PigStorage(',')
as (x1:int, x2:int, x3:int);
And encode it
b = foreach a generate 1 as key, EncodeVector(*) as v;
Note that the true meaning of * is very subtle
Now store it
store b into 'vectors.dat' using com.twitter.elephantbird.pig.store.SequenceFileStorage
(
'-c com.twitter.elephantbird.pig.util.IntWritableConverter’, '-c
com.twitter.elephantbird.pig.util.GenericWritableConverter
-t org.apache.mahout.math.VectorWritable’);
- 33. 33©MapR Technologies - Confidential
Train a Model
Pass previously encoded data to a sequential model trainer
define train org.apache.mahout.pig.LogisticRegression(
'iterations=5, inMemory=true, features=100000, categories=alt.atheism
comp.sys.mac.hardware rec.motorcycles sci.electronics talk.politics.guns
comp.graphics comp.windows.x rec.sport.baseball sci.med talk.politics.mideast
comp.os.ms-windows.misc misc.forsale rec.sport.hockey sci.space
talk.politics.misc comp.sys.ibm.pc.hardware rec.autos sci.crypt
soc.religion.christian talk.religion.misc');
Note that the argument is a string with its own syntax
- 34. 34©MapR Technologies - Confidential
Reservations and Qualms
Pig-vector isn’t done
And it is ugly
And it doesn’t quite work
And it is hard to build
But there seems to be promise
- 35. 35©MapR Technologies - Confidential
Potential
Add Naïve Bayes Model?
Somehow simplify the syntax?
Try a recent version of elephant-bird?
Switch to pigML?
- 36. 36©MapR Technologies - Confidential
Contact:
– tdunning@maprtech.com
– @ted_dunning
Slides and such:
– http://info.mapr.com/ted-bbuzz-2012
Hash tags: #bbuzz #mahout