29. 29 •
• Dimensionality reduction
• Sketching
• Approximate nearest neighbors
• Random projection trees
• Kernel approximations
• Newton sketches for optimization
• Linear programming
• …
Many applications
Googling for sketching was the best idea :-)
31. 31 •
Sparse random projections
…
Dasgupta, et al. "A sparse Johnson_Lindenstrauss
transform." STOC, 2010.
Achlioptas. “Database-friendly random projections: Johnson-Lindenstrauss
with binary coins”. JCSS, 2003.
Li et al, "Very Sparse Random Projections" , KDD 2006
33. 33 •
The Fast-JLT transform
= x x
Ailon and Chazelle. “Approximate nearest neighbors and the fast Johnson-
Lindenstrauss transform”. STOC 2006
34. 34 •
Orthogonalize: 𝐺 = 𝑄𝑅
Rescale: 𝐺𝑂𝑅𝐹 =
1
𝜎
𝑆𝑄
with 𝑆 ∼ 𝑑𝑖𝑎𝑔(𝜒𝑑)
Orthogonal random features
35. 35 •
First used for practical spherical LSH
Structured orthogonal random features
𝐺𝑆𝑂𝑅𝐹 =
𝑑
𝜎
𝐻𝐷1𝐻𝐷2𝐻𝐷3
Andoni et al. "Practical and optimal LSH for angular distance." Advances in
NeurIPS. 2015.
Choromanski et al. "The unreasonable effectiveness of structured random
orthogonal embeddings." NeurIPS. 2017.
36. 36 •
Define a family of hash functions 𝐹 such
that:
• Define a hash function ℎ from
ℎ1, … , ℎ𝑘 ∈ 𝐻𝑘
, eg:
ℎ 𝑥 = sgn < 𝑎, 𝑥 > with ai ∼ 𝑁(0,1)
• Use 𝐿 hash tables
LSH: Locality Sensitive Hashing
Indyk and Motwani, “Approximate nearest neighbors: towards removing the
curse of dimensionality”, STOC, 1998
Charikar, “Similarity estimation techniques from rounding algorithms”,
STOC, 2002
Retargeting:
User browse an e-commerce website
Moves on to a publisher website
Criteo buys ad placements
Criteo is paid if the ad is clicked
Retargeting:
User browse an e-commerce website
Moves on to a publisher website
Criteo buys ad placements
Criteo is paid if the ad is clicked
Retargeting:
User browse an e-commerce website
Moves on to a publisher website
Criteo buys ad placements
Criteo is paid if the ad is clicked
For each user, and for each client, compute offline recommendations with different algorithms. Append all these « sources » with the last historical products & you have a short list of products to score online, where you can estimate probability of click with logistic regression.
One technique is to compute a vector space for products to be able to compute nearest neighbors between products
The classical way to compute vectors is to factorize the interaction matrix, usually through a singular value decomposition (SVD). This is called collaborative filtering.
1. Intuition = there exists a transform with low distortion into O(log(n)) dimension (indep d!)
2. Get drawings and formulas from https://scikit-learn.org/stable/auto_examples/plot_johnson_lindenstrauss_bound.html
=> 3-4 slides
The ideal setting is when n is large, and of course d > log(n)
First idea: use a sparse projection matrix with +/- 1
Successive improvements
But issues with sparse inputs
Achlioptas. “Database-friendly random projections: Johnson-Lindenstrauss with binary coins”. JCSS, 2003.
Li et al, "Very Sparse Random Projections" , KDD 2006
Dasgupta, et al. "A sparse Johnson_Lindenstrauss transform." STOC, 2010.
Fourier transform idea: from uncertainty principle, if data and spectrum can’t be both sparse => work on spectrum.
Randomize selection of hadamard rows. Rerandomize to ensure non-sparsity.
Now P can be sparse gaussian as in original PHD, but results have been improved since and a simple coordinate sampling matrix is enough.
Ailon and Chazelle. “Approximate nearest neighbors and the fast Johnson-Lindenstrauss transform”. STOC 2006
A step further is taken by Clarkson & Woodruff, “Low rank approximation and regression in input sparsity time”, STOC 2013, where they actually used the CountMin to sample the matrix. However, it no longer has the JL properties.
These are not faster, but have better properties:
Originally JL proof used orthogonal features, dropped by Indyk et al. for LSH
Recently shown to yield lower variance kernel estimators with RFF (see later section on RFF)
While we are currently not aware how to prove rigorously that such pseudo-random rotations perform as well as the fully random ones, empirical evaluations show that three applications of HDi are exactly equivalent to applying a true random rotation (when d tends to infinity). We note that only two applications of HDi are not sufficient.
A well-known application to the approximate nearest neighbors problem which you might find useful in real life.
This is a very nice application to approximating kernels.
https://www.youtube.com/watch?v=Qi1Yry33TQE
More modern applications
3 more examples coming
SGD everywhere.
Bandit algorithm & how children learn
The exploration-exploitation trade-off exists everywhere, eg in scientific research.