News From Mahout

1©MapR Technologies - Confidential
News From Mahout

whoami – Ted Dunning
 Chief Application Architect, MapR Technologies
 Committer, member, Apache Software Foundation
– particularly Mahout, Zookeeper and Drill
(we’re hiring)
 Contact me at
tdunning@maprtech.com
tdunning@apache.com
ted.dunning@gmail.com
@ted_dunning

 Slides and such (available late tonight):
– http://www.mapr.com/company/events/nyhug-03-05-2013
 Hash tags: #mapr #nyhug #mahout

New in Mahout
 0.8 is coming soon (1-2 months)
 gobs of fixes
 QR decomposition is 10x faster
– makes ALS 2-3 times faster
 May include Bayesian Bandits
 Super fast k-means
– fast
– online (!?!)

New in Mahout
 gobs of fixes
– fast
– online (!?!)
– fast
 Possible new edition of MiA coming
– Japanese and Korean editions released, Chinese coming

Real-time Learning

We have a product
to sell …
from a web-site

Bogus Dog Food is the Best!
Now available in handy 1 ton
bags!
Buy 5!
What
picture?
What tag-
line?
What call to
action?

The Challenge
 Design decisions affect probability of success
– Cheesy web-sites don’t even sell cheese
 The best designers do better when allowed to fail
– Exploration juices creativity
 But failing is expensive
– If only because we could have succeeded
– But also because offending or disappointing customers is bad

More Challenges
 Too many designs
– 5 pictures
– 10 tag-lines
– 4 calls to action
– 3 back-ground colors
=> 5 x 10 x 4 x 3 = 600 designs
 It gets worse quickly
– What about changes on the back-end?
– Search engine variants?
– Checkout process variants?

Example – AB testing in real-time
 I have 15 versions of my landing page
 Each visitor is assigned to a version
– Which version?
 A conversion or sale or whatever can happen
– How long to wait?
 Some versions of the landing page are horrible
– Don’t want to give them traffic

A Quick Diversion
 You see a coin
– What is the probability of heads?
– Could it be larger or smaller than that?
 I flip the coin and while it is in the air ask again
 I catch the coin and ask again
 I look at the coin (and you don’t) and ask again
 Why does the answer change?
– And did it ever have a single value?

A Philosophical Conclusion
 Probability as expressed by humans is subjective and depends on
information and experience

I Dunno

5 heads out of 10 throws

2 heads out of 12 throws

So now you understand
Bayesian probability

Another Quick Diversion
 Let’s play a shell game
 This is a special shell game
 It costs you nothing to play
 The pea has constant probability of being under each shell
(trust me)
 How do you find the best shell?
 How do you find it while maximizing the number of wins?

Pause for short
con-game

Interim Thoughts
 Can you identify winners or losers without trying them out?
 Can you ever completely eliminate a shell with a bad streak?
 Should you keep trying apparent losers?

So now you understand
multi-armed bandits

Conclusions
 Can you identify winners or losers without trying them out?
No
 Can you ever completely eliminate a shell with a bad streak?
No
 Should you keep trying apparent losers?
Yes, but at a decreasing rate

Is there an optimum
strategy?

Bayesian Bandit
 Compute distributions based on data so far
 Sample p1, p2 and p2 from these distributions
 Pick shell i where i = argmaxi pi
 Lemma 1: The probability of picking shell i will match the
probability it is the best shell
 Lemma 2: This is as good as it gets

And it works!
11000 100 200 300 400 500 600 700 800 900 1000
0.12
0
0.01
0.02
0.03
0.04
0.05
0.06
0.07
0.08
0.09
0.1
0.11
n
regret
ε- greedy, ε = 0.05
Bayesian Bandit with Gamma- Normal

Video Demo

The Code
 Select an alternative
 Select and learn
 But we already know how to count!
n = dim(k)[1]
p0 = rep(0, length.out=n)
for (i in 1:n) {
p0[i] = rbeta(1, k[i,2]+1, k[i,1]+1)
}
return (which(p0 == max(p0)))
for (z in 1:steps) {
i = select(k)
j = test(i)
k[i,j] = k[i,j]+1
}
return (k)

The Basic Idea
 We can encode a distribution by sampling
 Sampling allows unification of exploration and exploitation
 Can be extended to more general response models

The Original Problem
bags!
Buy 5!
x1
x2
x3

Response Function
p(win) = w qi
i
å xi
æ
è
ç
ö
ø
÷
6- 6 - 4 - 2 0 2 4
1
0
0.5
x
y

Generalized Banditry
 Suppose we have an infinite number of bandits
– suppose they are each labeled by two real numbers x and y in [0,1]
– also that expected payoff is a parameterized function of x and y
– now assume a distribution for θ that we can learn online
 Selection works by sampling θ, then computing f
 Learning works by propagating updates back to θ
– If f is linear, this is very easy
– For special other kinds of f it isn’t too hard
 Don’t just have to have two labels, could have labels and context
E z[ ] = f (x, y |q)

Context Variables
bags!
Buy 5!
x1
x2
x3
user.geo env.time env.day_of_week env.weekend

Caveats
 Original Bayesian Bandit only requires real-time
 Generalized Bandit may require access to long history for learning
– Pseudo online learning may be easier than true online
 Bandit variables can include content, time of day, day of week
 Context variables can include user id, user features
 Bandit × context variables provide the real power

You can do this
yourself!

Super-fast k-means Clustering

Rationale

What is Quality?
 Robust clustering not a goal
– we don’t care if the same clustering is replicated
 Generalization is critical
 Agreement to “gold standard” is a non-issue

An Example

Diagonalized Cluster Proximity

Clusters as Distribution Surrogate

Theory

For Example
Grouping these
two clusters
seriously hurts
squared distance
D4
2
(X) >
1
s 2
D5
2
(X)

Algorithms

Typical k-means Failure
Selecting two seeds
here cannot be
fixed with Lloyds
Result is that these two
clusters get glued
together

Ball k-means
 Provably better for highly clusterable data
 Tries to find initial centroids in each “core” of each real clusters
 Avoids outliers in centroid computation
initialize centroids randomly with distance maximizing tendency
for each of a very few iterations:
for each data point:
assign point to nearest cluster
recompute centroids using only points much closer than closest cluster

Still Not a Win
 Ball k-means is nearly guaranteed with k = 2
 Probability of successful seeding drops exponentially with k
 Alternative strategy has high probability of success, but takes
O(nkd + k3d) time

Still Not a Win
 Ball k-means is nearly guaranteed with k = 2
 Probability of successful seeding drops exponentially with k
 Alternative strategy has high probability of success, but takes O(
nkd + k3d ) time
 But for big data, k gets large

Surrogate Method
 Start with sloppy clustering into lots of clusters
κ = k log n clusters
 Use this sketch as a weighted surrogate for the data
 Results are provably good for highly clusterable data

Algorithm Costs
 Surrogate methods
– fast, sloppy single pass clustering with κ = k log n
– fast sloppy search for nearest cluster,
O(d log κ) = O(d (log k + log log n)) per point
– fast, in-memory, high-quality clustering of κ weighted centroids
O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
O(κ d log k) or O(d log κ log k) for larger k, looser quality
– result is k high-quality centroids
• Even the sloppy surrogate may suffice

Algorithm Costs
 Surrogate methods
– fast, sloppy single pass clustering with κ = k log n
– fast sloppy search for nearest cluster,
O(d log κ) = O(d ( log k + log log n )) per point
– fast, in-memory, high-quality clustering of κ weighted centroids
O(κ k d + k3 d) = O(k2 d log n + k3 d) for small k, high quality
O(κ d log k) or O( d log k ( log k + log log n ) ) for larger k, looser quality
– result is k high-quality centroids
• For many purposes, even the sloppy surrogate may suffice

Algorithm Costs
 How much faster for the sketch phase?
– take k = 2000, d = 10, n = 100,000
– k d log n = 2000 x 10 x 26 = 500,000
– d (log k + log log n) = 10(11 + 5) = 170
– 3,000 times faster is a bona fide big deal

How It Works
 For each point
– Find approximately nearest centroid (distance = d)
– If (d > threshold) new centroid
– Else if (u > d/threshold) new cluster
– Else add to nearest centroid
 If centroids > κ ≈ C log N
– Recursively cluster centroids with higher threshold

Implementation

But Wait, …
 Finding nearest centroid is inner loop
 This could take O( d κ ) per point and κ can be big
 Happily, approximate nearest centroid works fine

Projection Search
total ordering!

LSH Bit-match Versus Cosine
0 8 16 24 32 40 48 56 64
1
- 1
- 0.8
- 0.6
- 0.4
- 0.2
0
0.2
0.4
0.6
0.8
X Axis
YAxis

Results

Parallel Speedup?
1 2 3 4 5 20
10
100
20
30
40
50
200
Threads
Timeperpoint(μs)
2
3
4
5
6
8
10
12
14
16
Threaded version
Non- threaded
Perfect Scaling
✓

Quality
 Ball k-means implementation appears significantly better than
simple k-means
 Streaming k-means + ball k-means appears to be about as good as
ball k-means alone
 All evaluations on 20 newsgroups with held-out data
 Figure of merit is mean and median squared distance to nearest
cluster

Contact Me!
 We’re hiring at MapR in US and Europe
 MapR software available for research use
 Get the code as part of Mahout trunk (or 0.8 very soon)
 Contact me at tdunning@maprtech.com or @ted_dunning
 Share news with @apachemahout

News From Mahout

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (15)

En vedette

En vedette (9)

Similaire à News From Mahout

Similaire à News From Mahout (20)

Plus de MapR Technologies

Plus de MapR Technologies (20)

Dernier

Dernier (20)

News From Mahout

Notes de l'éditeur