1. Sparse Matrix Reconstruction
Michael Hankin
University of Southern California
mhankin@usc.edu
December 5, 2013
Michael Hankin (USC)
Matrix Completion
December 5, 2013
1 / 28
3. Overview of Matrix Completion Problem
Motivation: Say that Netflix has NMovies movies and NUsers users. Given
universal knowledge they could construct an NMovies xNUsers matrix of
ratings and thus predict which movies their users would enjoy, and how
much so.
However, all they have are the few ratings their users have taken the time
to input, and the data on which accounts have watched which movies.
Can the full matrix be reconstructed from this VERY sparse, noisy
sample?
Michael Hankin (USC)
Matrix Completion
December 5, 2013
3 / 28
4. Overview of Matrix Completion Problem
Idea: Without some constraint the values of the missing points could be
any real (or even complex!) number. Obviously we have to impose some
restrictions beginning with real numbers only!
Less obvious is the condition that the matrix be of low rank. In the Netflix
problem this is obvious: there really aren’t that many types of people (as
far a taste profile goes) or movies (as far as genre/appeal profile goes).
However this condition is relevant in many other scenarios as well.
Michael Hankin (USC)
Matrix Completion
December 5, 2013
4 / 28
5. Notational Interlude
For a matrix X define the nuclear norm to be
r
X
∗
|σi |
=
i=1
where the σi ’s are the singular values of the matrix and r is its rank (and
therefore the number of nonzero singular values).
Grievously abusing notation, we might say X ∗ = σ 1
If the true matrix is M and we observe only Mi,j ∀(i, j) ∈ Ω for some Ω
then let
Xi,j : (i, j) ∈ Ω
PΩ (X ) =
0
: (i, j) ∈ Ω
/
Michael Hankin (USC)
Matrix Completion
December 5, 2013
5 / 28
6. Problem statement
Given PΩ (M) and the knowledge that M is of low rank, recover M.
To do so we work with an approximation X . We want to minimize the
rank of X , however a direct approach would be NP-hard, in the same way
that σ 0 would be, so we relax our conditions in the same vein as
LASSO, and set up the problem:
min X
∗
≈ min σ
1
(1)
s.t. PΩ (M) = PΩ (X )
we end up with something slightly resembling the Dantzig selector, which
we know gives sparse results, and sparsity in σ is equivalent to low rank for
X.
Michael Hankin (USC)
Matrix Completion
December 5, 2013
6 / 28
7. To expand on this notion, consider the Dantzig case.
Level sets of
1 can be visually represented as the
diamond in the image to the right.
1
In the case of LASSO regression
1 ≤ 1 can be considered to be the
convex hull of the single-parameter euclidean basis vectors of unit length.
In the nuclear norm case,
∗ ≤ 1 is just the convex hull of the set of
rank 1 matrices whose spectral norm X 2 ≤ 1 (keeping in mind that
X 2 = σ ∞)
The solution to the previous minimization problem is the point at which
the smallest level set of the nuclear norm to intersect the subspace
{X : PΩ (M) = PΩ (X )} does so. Using the spatial intuition gleaned from
our study of LASSO we recognize that this will give a sparse set of singular
values, and therefore a low rank matrix, that agrees with M on all of Ω.
1
Credit to Nicolai Meinshausen: http://www.stats.ox.ac.uk/~meinshau/
Michael Hankin (USC)
Matrix Completion
December 5, 2013
7 / 28
8. Algorithm Background
Candes, Cai, and Shen introduced an algorithm that comes close to
solving our problem.
Let X be of low rank r and UΣV ∗ be its SVD, where
Σ = diag ({σi }1≤i≤r ) (because it has only r nonzero singular values.
N
ext they define the soft-thresholding operator:
Dτ (X ) = UDτ (Σ)V ∗
Dτ (Σ) = diag ({(σi − τ )+ }1≤i≤r )
for τ > 0, so that it shrinks all of the singular values of X , setting any that
were originally ≤ τ to 0, thereby reducing its rank.
Note: Dτ (X ) = arg minY 1 Y − X 2 + τ Y
F
2
This will affect the output of the algorithm.
Michael Hankin (USC)
Matrix Completion
∗
.
December 5, 2013
8 / 28
9. Algorithm
Start with some Y 0 that vanishes outside of Ω (an efficient choice for
Y 0 will be discussed later, but for now just use 0 or even M.)
Choose values for τ > 0 and a sequence δk corresponding to step sizes
At step k set X k = Dτ (Y k−1 )
Then set Y k = Y k−1 + δk PΩ (M − X k )
Michael Hankin (USC)
Matrix Completion
December 5, 2013
9 / 28
10. Algorithm Discussion
Notes on the algorithm:
Low Rank
The X k ’s will tend to have low rank, unless to many of the singular values
end up growing beyond τ so that further iterations do not lower the rank.
Both the authors and I found (empirically) that the rank of the X k ’s tend
to start low, and grow to a stable point after a few dozen iterations. As
long as the original matrix was of low rank, this stable point also tends to
be of low rank. Unfortunately the authors have been unable to prove this.
When the dimensions of X k are high, this low rank property allows us to
economize on memory by maintaining only the portion of its SVD
corresponding to non-0’d singular values instead of the entire, dense
matrix itself.
Michael Hankin (USC)
Matrix Completion
December 5, 2013
10 / 28
11. Algorithm Discussion
Notes on the algorithm:
Sparsity
The Y k ’s will always be sparse, and vanish outside of Ω. This is obvious
because we require that Y 0 be either equal to 0 or at least vanish outside
of Ω. PΩ (M − X k ) vanishes outside of Ω by definition, and if we assume
Y k−1 does to, then Y k = Y k−1 + PΩ (M − X k ) must have the same
property, and is therefore sparse.
This lessens our storage requirements (though we must still maintain the
dense matrices X k ) but more importantly it makes computing the SVD of
Y k much faster as long as clever computational approaches and a sparse
solver are used.
Michael Hankin (USC)
Matrix Completion
December 5, 2013
11 / 28
12. Proof that algorithm gives a solution to
1
X 2
F
2
s.t. PΩ (M) = PΩ (X )
min τ X
Michael Hankin (USC)
∗
+
Matrix Completion
(2)
December 5, 2013
12 / 28
13. Convergence Significance
Figure : Convergence towards true value for different tau and delta values
Michael Hankin (USC)
Matrix Completion
December 5, 2013
13 / 28
14. Convergence Significance
Figure : Convergence towards true value for different tau and delta values
Michael Hankin (USC)
Matrix Completion
December 5, 2013
14 / 28
15. Convergence Significance
As seen in the proof, the algorithm converges to the solution of:
1
X 2
F
2
s.t. PΩ (M) = PΩ (X )
min τ X
Michael Hankin (USC)
∗
+
Matrix Completion
(3)
December 5, 2013
15 / 28
16. Convergence Significance
Why is a solution to
1
X 2
F
2
s.t. PΩ (M) = PΩ (X )
min τ X
∗
+
(4)
satisfactory when we’re looking for a solution to
min X
∗
(5)
s.t. PΩ (M) = PΩ (X )
Michael Hankin (USC)
Matrix Completion
December 5, 2013
16 / 28
17. Proof of adequacy in a more general case.
Michael Hankin (USC)
Matrix Completion
December 5, 2013
17 / 28
18. Convergence Significance
Figure : Convergence towards true value for different tau and delta values
Michael Hankin (USC)
Matrix Completion
December 5, 2013
18 / 28
19. General Convex Constraints
Cai Candes and Shen extend their algorithm to the more general case,
addressed in the previous proof:
min fτ (X )
(6)
s.t. fi (X ) ≤ 0 ∀i
Where the fi (X )’s are convex, left semi-continuous functionals.
Michael Hankin (USC)
Matrix Completion
December 5, 2013
19 / 28
20. Generalized Algorithm
In that case, the algorithm is as follows:
Denote F(X ) = (f1 (X )...fn (X )) and initialize y 0
X k = arg minX fτ (X ) + y k−1 , F(X )
y k = (y k−1 + δk F(X k ))+
In the special case where the constraints are linear, ie A(X ) ≤ b for some
linear functional A, the iterations are as follows:
X k = Dτ (A∗ (y k−1 ))
y k = (y k−1 + δk (b − A(X k ))+
Consider b = {Mi,j }(i,j)∈Ω , A(X ) = {Mi,j }(i,j)∈Ω , and its adjoint A∗ (y )
mapping y to a sparse matrix X with entries only on indices in Ω and
values equal to those in y .
Michael Hankin (USC)
Matrix Completion
December 5, 2013
20 / 28
21. Use Case
Noise!
If our data is noisy we can use |Xi,j − Mi,j | < ∀(i, j) ∈ Ω
This is
Example
Triangulation If the matrix in question is distances between points we can
fill in the relative locations with just a few entries.
Michael Hankin (USC)
Matrix Completion
December 5, 2013
21 / 28
22. Noise Free
Figure : Rank 10 matrix
Michael Hankin (USC)
Matrix Completion
December 5, 2013
22 / 28
23. Noisy
Figure : Rank 10 matrix with a little noise, using exact matrix reconstruction
Michael Hankin (USC)
Matrix Completion
December 5, 2013
23 / 28
26. Stopping Criteria
Because we expect PΩ (M − X ) to converge to zero the authors suggest
using PΩ (M−X ) F ≤ as a stopping criteria. Because I generated my own
PΩ (M) F
data I can actually plot
M−X
M F
F
Figure : Rank 10 matrix with no noise
Michael Hankin (USC)
Matrix Completion
December 5, 2013
26 / 28
27. WORK IN PROGRESS When can a matrix be reconstructed, and how
much data is required? The most obvious issues arise when either a row
or a column of PΩ (M) is all 0. In that case nothing can be done as that
row (or column) could be totally independent of the others.
Along those lines, if any row or column in the unshredded M is all 0, we
are out of luck, as PΩ (M) must also have a 0 row (or column). Even when
there are no such rows or columns in M, if any of its singular vectors are
too heavily skewed in a euclidean basis direction, the likelihood of one of
the rows (or columns) of PΩ (M) being 0 is high. Also, note that an n1 xn2
matrix of rank r has (n1 − r )r + r 2 + (n2 − r )r degrees of freedom.
Michael Hankin (USC)
Matrix Completion
December 5, 2013
27 / 28
28. References
Cai, J.-F., Cands, E. J. and Shen, Z. (2010)
A singular value thresholding algorithm for matrix completion.
SIAM J. Optim. 20, 1956-1982.
Cands, E. J. and Plan, Y. (2010)
Matrix completion with noise
Proceedings of the IEEE 98, 925-936
Michael Hankin (USC)
Matrix Completion
December 5, 2013
28 / 28