The document discusses the post office problem, which is also known as the nearest neighbor search problem. It describes how k-d trees can be used to solve this problem more efficiently than a naive linear search approach. It explains how k-d trees are built and then searched to find the nearest neighbor point to a query point. It also discusses how the curse of dimensionality can be addressed using the Johnson-Lindenstrauss lemma to project the data into a lower dimensional space to speed up searches while largely preserving relative distances between points.
1. The Post Office Problem
k-d trees, k-nn search, and the Johnson-
Lindenstrauss lemma
2. Who am I?
Jeremy Holland
Senior lead developer at
Centresource
Math and algorithms nerd
@awebneck,
github.com/awebneck, freenode:
awebneck, (you get the idea)
3. If you like the talk...
I like scotch. Just putting it out there.
4. What is the Post Office Problem?
Don Knuth, professional CS badass.
TAOCP, vol. 3
Otherwise known as ”Nearest Neighbor search”
Let's say you've...
6. But you need to mail a letter!
Which post office do you go to?
7. Finding free images of post
offices is hard, so...
We'll just reduce it to this:
q
8. Naive implementation
Calculate distance to all points, find smallest
min = INFINITY
P = <points to be searched>
K = <dimensionality of points, e.g. 2>
q = <query point>
best = nil
for p in P do
dimDistSum = 0
for k in K do
dimDistSum += (q[k]-p[k])**2
dist = dimDistSum.sqrt
if dist < min
min = dist
best = p
return best
9. With a little preprocessing...
But that takes time! - can we do better?
You bet!
k-d tree
Binary tree (each node has at most two
children)
Each node represents a single point in the set
to be searched
10. Each node looks like...
Domain: the vector describing the point (i.e.
[p[0], p[1], … p[k-1]])
Range: Some identifying characteristic (e.g. PK
in a database)
Split: A chosen dimension from 0 ≤ split < k
Left: The left child (left.domain[split] <
self.domain[split])
Right: The right child (right.domain[split] ≥
self.domain[split])
37. How do we search it?
All the way back to the root!
38. How do we search it?
And you have your nearest neighbor, with a good
case of running time!
I'm the answer!
39. But that was a pretty good case...
We barely had to backtrack at all – best case is
Worst case (lots of backtracking – examining
almost every node) can get up to
Amount of backtracking is directly proportional
to k!
If k is small (say 2, as in this example) and n is
large, we see a huge improvement over linear
search
As k becomes large, the benefits of this over a
naive implementation virtually disappear!
40. The Curse of Dimensionality
Curse you, dimensionality!
High-dimensional vector spaces are darned
hard to search!
Why? Too many dimensions! Why are there so
many dimensions!?!
What can we do about it?
Get rid of the extra weight!
Enter Mssrs. Johnson and Lindenstrauss
41. It turns out...
Your vectors have a high dimension
Absolute distance and precise location versus
relative distance between points
Relative distance can be largely preserved by a
lower dimensional space
Reduce k dimensions to kproj dimensions,
kproj << k
49. It turns out...
Relative distance can be largely but not
completely preserved by a lower dimensional
space
Every projection will have errors
How do you choose one with the fewest?
Trick question: Let fate decide!
50. Multiple random projection
Choose the projections radomly
Multiple projections
Exchange cost in resources for cost in accuracy
More projections = greater resource cost = greater
accuracy
Fewer projections = lesser resource cost = lesser
accuracy
Trivially parallelizable
Learn to be happy with ”good enough”
51. Multiple random projections
Get the nearest from each projection, then run a
naive nearest on the results thereof.
Nns = []
P = <projections>
q = <query point>
for p in P do
pq = <project q to the same plane as p>
nns << <nearest neighbor to pq from projection>
<execute naive nearest on nns to find nearest of result>
return nn
Et voilá!
52. Multiple random projection
Experiments yield > 98% accuracy when
multiple nearest neighbors are selected from
each projection and d is reduced from 256 to
15, with approximately 30% of the calculation.
(see credits)
Additional experiments yielded similar results,
as did my own
That's pretty darn-tootin' good
53. Stuff to watch out for
Balancing is vitally important (assuming uniform
distribution of points): careful attention must be
paid to selection of nodes (node with median
coordinate for split axis)
Cycle through axes for each level of the tree –
root should split on 0, lvl 1 on 1, lvl 2 on 2, etc.
54. Stuff to watch out for
Building the trees still takes some time
Building the projections is effectively matrix
multiplication, time in (Strassen's
algorithm)
Building the (balanced) trees from the projections
takes time in approximately
Solution: build the trees ahead of time and
store them for later querying (i.e. index those
bad boys!)
55. Thanks!
Credits:
Based in large part on research conducted by
Yousuf Ahmed, NYU: http://bit.ly/NZ7ZHo
K-d trees: J. L. Bentley, Stanford U.:
http://bit.ly/Mpy05p
Dimensionality reduction: W. B. Johnson and J.
Lindenstrauss: http://bit.ly/m9SGPN
Research Fuel: Ardbeg Uigeadail:
http://bit.ly/fcag0E