The Post Office Problem

The Post Office Problem

k-d trees, k-nn search, and the Johnson-
Lindenstrauss lemma

Who am I?
 Jeremy Holland
 Senior lead developer at
Centresource
 Math and algorithms nerd
 @awebneck,
github.com/awebneck, freenode:
awebneck, (you get the idea)

If you like the talk...
 I like scotch. Just putting it out there.

What is the Post Office Problem?
 Don Knuth, professional CS badass.
 TAOCP, vol. 3
 Otherwise known as ”Nearest Neighbor search”
 Let's say you've...

But you need to mail a letter!
Which post office do you go to?

Finding free images of post
offices is hard, so...
We'll just reduce it to this:

q

Naive implementation
Calculate distance to all points, find smallest
min = INFINITY
P = <points to be searched>
K = <dimensionality of points, e.g. 2>
q = <query point>
best = nil
for p in P do
dimDistSum = 0
for k in K do
dimDistSum += (q[k]-p[k])**2
dist = dimDistSum.sqrt
if dist < min
min = dist
best = p
return best

With a little preprocessing...
 But that takes time! - can we do better?
 You bet!
 k-d tree
 Binary tree (each node has at most two
children)
 Each node represents a single point in the set
to be searched

Each node looks like...
 Domain: the vector describing the point (i.e.
[p[0], p[1], … p[k-1]])
 Range: Some identifying characteristic (e.g. PK
in a database)
 Split: A chosen dimension from 0 ≤ split < k
 Left: The left child (left.domain[split] <
self.domain[split])
 Right: The right child (right.domain[split] ≥
self.domain[split])

Let's build a k-d tree!
Point 1: [20,10]

Let's split on the x axis

Add a new point: [10,5]

The new point is the Left Child of the first point

Let's split him on the y axis

And add a 3rd point: [25,3]

The new point is the Right Child of the first point

So on and so forth...

Giving you a tree:

How do we search it?
Step 1: Find the best bin (where the query point
would otherwise be inserted)

q

root

NOTE: There is no node for this bin – just the
space a node would be if existed!

q

root

Step 2: Make the current leaf node the current
”best guess”

Best guess

… and set the ”best guess radius” to be the
distance between the query and that point

Best guess radius

Step 3: Back up the tree 1 node

Current node

If the distance between the query and the new
node is less than the best guess radius...

Then set the best guess radius to the new
distance, and make the current node the best

Step 4: If the hypersphere described by the best
guess radius crosses the current split...

Oh nooooh!

And the current node has a child on the other
side...

Oh snap!

… then make that node the current node, and
repeat:

Here, the distance is not less than the best guess
radius...

… and the hypersphere neither crosses the
split ...
Whew, missed it!

… nor does the current node have any children ...

Whew, missed it!

So we can eliminate it and back up the tree again!

We've already compared this node, so let's keep
going back up the tree

Again, the radius is bigger than the best guess,
and there is no crossing – back up again!

...and again...

All the way back to the root!

And you have your nearest neighbor, with a good
case of running time!

I'm the answer!

But that was a pretty good case...
 We barely had to backtrack at all – best case is

 Worst case (lots of backtracking – examining
almost every node) can get up to
 Amount of backtracking is directly proportional
to k!
 If k is small (say 2, as in this example) and n is
large, we see a huge improvement over linear
search
 As k becomes large, the benefits of this over a
naive implementation virtually disappear!

The Curse of Dimensionality
 Curse you, dimensionality!
 High-dimensional vector spaces are darned
hard to search!
 Why? Too many dimensions! Why are there so
many dimensions!?!
 What can we do about it?
 Get rid of the extra weight!
 Enter Mssrs. Johnson and Lindenstrauss

It turns out...
 Your vectors have a high dimension
 Absolute distance and precise location versus
relative distance between points
 Relative distance can be largely preserved by a
lower dimensional space
 Reduce k dimensions to kproj dimensions,
kproj << k

Example: 2d to 1d

11.180
5.0
00

8.246
7.28

17.
464
0

7.
6 0
.29 71
10
3.16

13.34
2
2

14.000

Example: 2d to 1d, 1st attempt

11.180 Projection Plane
5.0
00

8.246
7.28

17.
464
0

7.
6 0
.29 71
10
3.16

13.34
2
2

14.000

Example: 2d to 1d, 1st attempt

11.180
5.0
00

8.246
7.28

17.
464
0

7.
6 0
.29 71
10
3.16

13.34
2
2

14.000

Finished 1-d Projection

Example: 2d to 1d, 2nd attempt

5.0
00

8.246
7.28

17.
464
0

7.
6 0
.29 71
10
3.16

13.34
2
2

14.000

Example: 2d to 1d, 2nd attempt

11.180
5.0
00

8.246
7.28

17.
464
0

7.
6 0
.29 71
10
3.16

13.34
2
2

14.000


Example: 2d to 1d, 3rd attempt

5.0
00

8.246
7.28

17.
464
0

7.
6 0
.29 71
10
3.16

13.34
2
2

14.000

It turns out...

 Relative distance can be largely but not
completely preserved by a lower dimensional
space
 Every projection will have errors
 How do you choose one with the fewest?
 Trick question: Let fate decide!

Multiple random projection
 Choose the projections radomly
 Multiple projections
 Exchange cost in resources for cost in accuracy
 More projections = greater resource cost = greater
accuracy
 Fewer projections = lesser resource cost = lesser
accuracy
 Trivially parallelizable
 Learn to be happy with ”good enough”

Multiple random projections
Get the nearest from each projection, then run a
naive nearest on the results thereof.
Nns = []
P = <projections>
q = <query point>
for p in P do
pq = <project q to the same plane as p>
nns << <nearest neighbor to pq from projection>
<execute naive nearest on nns to find nearest of result>
return nn

Et voilá!

Multiple random projection
 Experiments yield > 98% accuracy when
multiple nearest neighbors are selected from
each projection and d is reduced from 256 to
15, with approximately 30% of the calculation.
(see credits)
 Additional experiments yielded similar results,
as did my own
 That's pretty darn-tootin' good

Stuff to watch out for
 Balancing is vitally important (assuming uniform
distribution of points): careful attention must be
paid to selection of nodes (node with median
coordinate for split axis)
 Cycle through axes for each level of the tree –
root should split on 0, lvl 1 on 1, lvl 2 on 2, etc.

Stuff to watch out for
 Building the trees still takes some time
 Building the projections is effectively matrix
multiplication, time in (Strassen's
algorithm)
 Building the (balanced) trees from the projections
takes time in approximately
 Solution: build the trees ahead of time and
store them for later querying (i.e. index those
bad boys!)

Thanks!
 Credits:
 Based in large part on research conducted by
Yousuf Ahmed, NYU: http://bit.ly/NZ7ZHo
 K-d trees: J. L. Bentley, Stanford U.:
http://bit.ly/Mpy05p
 Dimensionality reduction: W. B. Johnson and J.
Lindenstrauss: http://bit.ly/m9SGPN
 Research Fuel: Ardbeg Uigeadail:
http://bit.ly/fcag0E

The Post Office Problem

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

En vedette

En vedette (20)

Dernier

Dernier (20)

The Post Office Problem