Contenu connexe
Similaire à Boston hug-2012-07 (20)
Boston hug-2012-07
- 1. Mahout, New and Improved
Now with Super Fast Clustering
©MapR Technologies - Confidential 1
- 2. Agenda
What happened in Mahout 0.7
– less bloat
– simpler structure
– general cleanup
©MapR Technologies - Confidential 2
- 3. To Cut Out Bloat
©MapR Technologies - Confidential 3
- 5. Bloat is Leaving in 0.7
Lots of abandoned code in Mahout
– average code quality is poor
– no users
– no maintainers
– why do we care?
Examples
– old LDA
– old Naïve Bayes
– genetic algorithms
If you care, get on the mailing list
©MapR Technologies - Confidential 5
- 6. Bloat is Leaving in 0.7
Lots of abandoned code in Mahout
– average code quality is poor
– no users
– no maintainers
– why do we care?
Examples
– old LDA
– old Naïve Bayes
– genetic algorithms
If you care, get on the mailing list
– oops, too late since 0.7 is already released
©MapR Technologies - Confidential 6
- 8. Nobody Cares about Collections
We need it, math is built on it
Pull it into math
Broke the build (battle of the code expanders)
Fixed now (thanks to Grant)
©MapR Technologies - Confidential 8
- 10. What is it?
Supports access to Mahout functionality from Pig
So far -- text vectorization
And classification
And model saving
©MapR Technologies - Confidential 10
- 11. What is it?
Supports Pig access to Mahout functions
So far text vectorization
And classification
And model saving
Kind of works (see pigML from twitter for better function)
©MapR Technologies - Confidential 11
- 12. Compile and Install
Start by compiling and installing mahout in your local repository:
cd ~/Apache
git clone https://github.com/apache/mahout.git
cd mahout
mvn install -DskipTests
Then do the same with pig-vector
cd ~/Apache
git clone git@github.com:tdunning/pig-vector.git
cd pig-vector
mvn package
©MapR Technologies - Confidential 12
- 13. Tokenize and Vectorize Text
Tokenized is done using a text encoder
– the dimension of the resulting vectors (typically 100,000-1,000,000
– a description of the variables to be included in the encoding
– the schema of the tuples that pig will pass together with their data types
Example:
define EncodeVector
org.apache.mahout.pig.encoders.EncodeVector
('10','x+y+1', 'x:numeric, y:word, z:text');
You can also add a Lucene 3.1 analyzer in parentheses if you want
something fancier
©MapR Technologies - Confidential 13
- 14. The Formula
Not normal arithmetic
Describes which variables to use, whether offset is included
Also describes which interactions to use
©MapR Technologies - Confidential 14
- 15. The Formula
Not normal arithmetic
Describes which variables to use, whether offset is included
Also describes which interactions to use
– but that doesn’t do anything yet!
©MapR Technologies - Confidential 15
- 16. Load and Encode Data
Load the data
a = load '/Users/tdunning/Downloads/NNBench.csv' using PigStorage(',')
as (x1:int, x2:int, x3:int);
And encode it
b = foreach a generate 1 as key, EncodeVector(*) as v;
Note that the true meaning of * is very subtle
Now store it
store b into 'vectors.dat' using com.twitter.elephantbird.pig.store.SequenceFileStorage
(
'-c com.twitter.elephantbird.pig.util.IntWritableConverter’, '-c
com.twitter.elephantbird.pig.util.GenericWritableConverter
-t org.apache.mahout.math.VectorWritable’);
©MapR Technologies - Confidential 16
- 17. Train a Model
Pass previously encoded data to a sequential model trainer
define train org.apache.mahout.pig.LogisticRegression(
'iterations=5, inMemory=true, features=100000, categories=alt.atheism
comp.sys.mac.hardware rec.motorcycles sci.electronics talk.politics.guns
comp.graphics comp.windows.x rec.sport.baseball sci.med talk.politics.mideast
comp.os.ms-windows.misc misc.forsale rec.sport.hockey sci.space
talk.politics.misc comp.sys.ibm.pc.hardware rec.autos sci.crypt
soc.religion.christian talk.religion.misc');
Note that the argument is a string with its own syntax
©MapR Technologies - Confidential 17
- 18. Reservations and Qualms
Pig-vector isn’t done
And it is ugly
And it doesn’t quite work
And it is hard to build
But there seems to be promise
©MapR Technologies - Confidential 18
- 19. Potential
Add Naïve Bayes Model?
Somehow simplify the syntax?
Try a recent version of elephant-bird?
Switch to pigML?
©MapR Technologies - Confidential 19
- 21. Goals
Cluster very large data sets
Facilitate large nearest neighbor search
Allow very large number of clusters
Achieve good quality
– low average distance to nearest centroid on held-out data
Based on Mahout Math
Runs on Hadoop (really MapR) cluster
FAST – cluster tens of millions in minutes
©MapR Technologies - Confidential 21
- 22. Non-goals
Use map-reduce (but it is there)
Minimize the number of clusters
Support metrics other than L2
©MapR Technologies - Confidential 22
- 23. Anti-goals
Multiple passes over original data
Scale as O(k n)
©MapR Technologies - Confidential 23
- 26. What’s that?
Find the k nearest training examples
Use the average value of the target variable from them
This is easy … but hard
– easy because it is so conceptually simple and you have few knobs to turn
or models to build
– hard because of the stunning amount of math
– also hard because we need top 50,000 results, not just single nearest
Initial prototype was massively too slow
– 3K queries x 200K examples takes hours
– needed 20M x 25M in the same time
©MapR Technologies - Confidential 26
- 31. How We Did It
2 week hackathon with 6 developers from MapR customer
Agile-ish development
To avoid IP issues
– all code is Apache Licensed (no ownership question)
– all data is synthetic (no question of private data)
– all development done on individual machines, hosting on Github
– open is easier than closed (in this case)
Goal is new open technology to facilitate new closed solutions
Ambitious goal of ~ 1,000,000 x speedup
©MapR Technologies - Confidential 31
- 32. How We Did It
2 week hackathon with 6 developers from customer bank
Agile-ish development
To avoid IP issues
– all code is Apache Licensed (no ownership question)
– all data is synthetic (no question of private data)
– all development done on individual machines, hosting on Github
– open is easier than closed (in this case)
Goal is new open technology to facilitate new closed solutions
Ambitious goal of ~ 1,000,000 x speedup
– well, really only 100-1000x after basic hygiene
©MapR Technologies - Confidential 32
- 33. What We Did
Mechanism for extending Mahout Vectors
– DelegatingVector, WeightedVector, Centroid
Shared memory matrix
– FileBasedMatrix uses mmap to share very large dense matrices
Searcher interface
– Brute, ProjectionSearch, KmeansSearch, LshSearch
Super-fast clustering
– Kmeans, StreamingKmeans
©MapR Technologies - Confidential 33
- 35. Projection Search
Projection onto a line provides a total order on data
Nearby points stay nearby
Some other points also wind up close
Search points just before or just after the query point
©MapR Technologies - Confidential 35
- 37. K-means Search
Simple Idea
– pre-cluster the data
– to find the nearest points, search the nearest clusters
Recursive application
– to search a cluster, use a Searcher!
©MapR Technologies - Confidential 37
- 43. But This Requires k-means!
Need a new k-means algorithm to get speed
– Hadoop is very slow at iterative map-reduce
– Maybe Pregel clones like Giraph would be better
– Or maybe not
Streaming k-means is
– One pass (through the original data)
– Very fast (20 us per data point with threads on one node)
– Very parallelizable
©MapR Technologies - Confidential 43
- 44. Basic Method
Use a single pass of k-means with very many clusters
– output is a bad-ish clustering but a good surrogate
Use weighted centroids from step 1 to do in-memory clustering
– output is a good clustering with fewer clusters
©MapR Technologies - Confidential 44
- 45. Algorithmic Details
Foreach data point xn
compute distance to nearest centroid, ∂
sample u, if u > ∂/ß add to nearest centroid
else create new centroid
if number of centroids > k log n
recursively cluster centroids
set ß = 1.5 ß if number of centroids did not decrease
©MapR Technologies - Confidential 45
- 46. How It Works
Result is large set of centroids
– these provide approximation of original distribution
– we can cluster centroids to get a close approximation of clustering original
– or we can just use the result directly
©MapR Technologies - Confidential 46
- 47. Parallel Speedup?
200
Non- threaded
✓
100
2
Tim e per point (μs)
Threaded version
3
50
4
40 6
5
8
30
10 14
12
20 Perfect Scaling 16
10
1 2 3 4 5 20
Threads
©MapR Technologies - Confidential 47
- 48. Warning, Recursive Descent
Inner loop requires finding nearest centroid
With lots of centroids, this is slow
But wait, we have classes to accelerate that!
©MapR Technologies - Confidential 48
- 49. Warning, Recursive Descent
Inner loop requires finding nearest centroid
With lots of centroids, this is slow
But wait, we have classes to accelerate that!
(Let’s not use k-means searcher, though)
©MapR Technologies - Confidential 49
- 50. Warning, Recursive Descent
Inner loop requires finding nearest centroid
With lots of centroids, this is slow
But wait, we have classes to accelerate that!
(Let’s not use k-means searcher, though)
Empirically, projection search beats 64 bit LSH by a bit
– More optimization may change this story
©MapR Technologies - Confidential 50
- 51. Moving to Ultra Mega Super Scale
Map-reduce implementation nearly trivial
Map: rough-cluster input data, output ß, weighted centroids
Reduce:
– single reducer gets all centroids
– if too many centroids, merge using recursive clustering
– optionally do final clustering in-memory
Combiner possible, but not important
©MapR Technologies - Confidential 51
- 52. Contact:
– tdunning@maprtech.com
– @ted_dunning
Slides and such:
– http://info.mapr.com/ted-boston-2012-07
Hash tags: #boston-hug #mahout #mapr
©MapR Technologies - Confidential 52