2. Apache Mahout
Now with extra whitening and classification powers!
Thursday, November 4, 2010
3. • Mahout intro
• Scalability in general
• Supervised learning recap
• The new SGD classifiers
Thursday, November 4, 2010
4. Mahout?
• Hebrew for “essence”
• Hindi for a guy who drives an elephant
Thursday, November 4, 2010
5. Mahout?
• Hebrew for “essence”
• Hindi for a guy who drives an elephant
Thursday, November 4, 2010
6. Mahout?
• Hebrew for “essence”
• Hindi for a guy who drives an elephant
Thursday, November 4, 2010
7. Mahout!
• Scalable data-mining and recommendations
• Not all data-mining
• Not the fanciest data-mining
• Just some of the scalable stuff
• Not a competitor for R or Weka
Thursday, November 4, 2010
8. General Areas
• Recommendations
• lots of support, lots of flexibility,
production ready
• Unsupervised learning (clustering)
• lots of options, lots of flexibility,
production ready (ish)
Thursday, November 4, 2010
9. General Areas
• Supervised learning (classification)
• multiple architectures, fair number of
options, somewhat inter-operable
• production ready (for the right definition
of production and ready)
• Large scale SVD
• larger scale coming, beware sharp edges
Thursday, November 4, 2010
10. Scalable?
• Scalable means
• Time is proportional to problem size by
resource size
• Does not imply Hadoop or parallel
THE AUTHOR
t ∝
|P|
|R|
Thursday, November 4, 2010
11. Wall
Clock
Time
# of Training Examples
Scalable Algorithm
(Mahout wins!)
Traditional
Datamining
Works here
Scalable Solutions Required
Non-scalable Algorithm
Thursday, November 4, 2010
12. Scalable means ...
• One unit of work requires about a unit of
time
• Not like the company store (bit.ly/22XVa4)
t ∝
|P|
|R|
|P| = O(1) =⇒ t = O(1)
Thursday, November 4, 2010
13. Wall
Clock
Time
# of Training Examples
Parallel Algorithm
Sequential
Algorithm
Preferred
Parallel Algorithm Preferred
Sequential Algorithm
Thursday, November 4, 2010
15. Training Data Sample
yes
no 0.92 0.01 circle
0.30 0.41 square
Filled?
x coordinate y coordinate
shape
predictor
variables
target
variable
Thursday, November 4, 2010
17. SGD Classification
• Supervised learning of logistic regression
• Sequential gradient descent, not parallel
• Highly optimized for high dimensional
sparse data, possibly with interactions
• Scalable, real dang fast to train
Thursday, November 4, 2010
18. Supervised Learning
T x1 ... xn
T x1 ... xn
T x1 ... xn
T x1 ... xn
T x1 ... xn
Model
Model
T
T
T
T
T
Learning
Algorithm
? x1 ... xn
? x1 ... xn
? x1 ... xn
? x1 ... xn
? x1 ... xn
Thursday, November 4, 2010
19. Supervised Learning
T x1 ... xn
T x1 ... xn
T x1 ... xn
T x1 ... xn
T x1 ... xn
Model
Model
T
T
T
T
T
Learning
Algorithm
? x1 ... xn
? x1 ... xn
? x1 ... xn
? x1 ... xn
? x1 ... xn
Sequential
but fast
Thursday, November 4, 2010
20. Supervised Learning
T x1 ... xn
T x1 ... xn
T x1 ... xn
T x1 ... xn
T x1 ... xn
Model
Model
T
T
T
T
T
Learning
Algorithm
? x1 ... xn
? x1 ... xn
? x1 ... xn
? x1 ... xn
? x1 ... xn
Sequential
but fast
Stateless,
parallel
Thursday, November 4, 2010
21. Small example
• On 20 newsgroups
• converges in < 10,000 training examples
(less than one pass through the data)
• accuracy comparable to SVM, Naive
Bayes, Complementary Naive Bayes
• learning rate, regularization set
automagically on held-out data
Thursday, November 4, 2010
23. Training API
public interface OnlineLearner {
void train(int actual, Vector instance);
void train(long trackingKey, int actual, Vector instance);
void train(long trackingKey, String groupKey, int actual, Vector instance);
void close();
}
Thursday, November 4, 2010
24. Classification API
public class AdaptiveLogisticRegression implements OnlineLearner {
public AdaptiveLogisticRegression(int numCategories, int numFeatures,
PriorFunction prior);
public void train(int actual, Vector instance);
public void train(long trackingKey, int actual, Vector instance);
public void train(long trackingKey, String groupKey, int actual,
Vector instance);
public void close();
public double auc();
public State<Wrapper> getBest();
}
CrossFoldLearner model = learningAlgorithm.getBest().getPayload().getLearner();
double averageCorrect = model.percentCorrect();
double averageLL = model.logLikelihood();
double p = model.classifyScalar(features);
Thursday, November 4, 2010
25. Speed?
• Encoding API for hashed feature vectors
• String, byte[] or double interfaces
• String allows simple parsing
• byte[] and double allows speed
• Abstract interactions supported
Thursday, November 4, 2010
26. Speed!
• Parsing and encoding dominate single
learner
• Moderate optimization allows 1 million
training examples with 200 features to be
encoded in 14 seconds in a single core
• 20 million mixed text, categorical features
with many interactions learned in ~ 1 hour
Thursday, November 4, 2010
27. More Speed!
• Evolutionary optimization of learning
parameters allows simple operation
• 20x threading allows high machine use
• 20 newsgroup test completes in less time
on single node with SGD than on Hadoop
with Complementary Naive Bayes
Thursday, November 4, 2010
28. Summary
• Mahout provides early production quality
scalable data-mining
• New classification systems allow industrial
scale classification
Thursday, November 4, 2010