7. Sparse Regularizations
Synthesis regularization Analysis regularization
D
Coe cients Image x = Image x Correlations
1 1 =D x
min ||y ⇥ ||2 + ⇥|| ||1 min ||y x||2 + ||D x||1
RQ 2
2
x RN 2
2
8. Sparse Regularizations
Synthesis regularization Analysis regularization
D
Coe cients Image x = Image x Correlations
1 1 =D x
min ||y ⇥ ||2 + ⇥|| ||1 min ||y x||2 + ||D x||1
RQ 2
2
x RN 2
2
=x D⇤ x
=
6= 0
9. Sparse Regularizations
Synthesis regularization Analysis regularization
D
Coe cients Image x = Image x Correlations
1 1 =D x
min ||y ⇥ ||2 + ⇥|| ||1 min ||y x||2 + ||D x||1
RQ 2
2
x RN 2
2
=x D⇤ x
=
6= 0
Unless D = is orthogonal, produces di⇥erent results.
10. Variations and Stability
Observations: y = x0 + w
Recovery: 1
x (y) 2 argmin || x y||2 + ||D⇤ x||1 (P (y))
0+
x2RN 2
x0+ (y) 2 argmin ||D⇤ x||1 (no noise) (P0 (y))
x=y
11. Variations and Stability
Observations: y = x0 + w
Recovery: 1
x (y) 2 argmin || x y||2 + ||D⇤ x||1 (P (y))
0+
x2RN 2
x0+ (y) 2 argmin ||D⇤ x||1 (no noise) (P0 (y))
x=y
Questions:
– Behavior of x (y) with respect to y and .
! Application: risk estimation (SURE, GCV, etc.)
12. Variations and Stability
Observations: y = x0 + w
Recovery: 1
x (y) 2 argmin || x y||2 + ||D⇤ x||1 (P (y))
0+
x2RN 2
x0+ (y) 2 argmin ||D⇤ x||1 (no noise) (P0 (y))
x=y
Questions:
– Behavior of x (y) with respect to y and .
! Application: risk estimation (SURE, GCV, etc.)
– Criteria to ensure ||x (y) x0 || 6 C||w||
(with “reasonable” C)
13. Variations and Stability
Observations: y = x0 + w
Recovery: 1
x (y) 2 argmin || x y||2 + ||D⇤ x||1 (P (y))
0+
x2RN 2
x0+ (y) 2 argmin ||D⇤ x||1 (no noise) (P0 (y))
x=y
Questions:
– Behavior of x (y) with respect to y and .
! Application: risk estimation (SURE, GCV, etc.)
– Criteria to ensure ||x (y) x0 || 6 C||w||
(with “reasonable” C)
Synthesis case (D = Id): works of Fuchs and Tropp.
Analysis case: [Nam et al. 2011] for w = 0.
15. How to choose the value of the parameter ?
Risk Minimization
Risk-based selection of
I RiskAveragetorisk:
associated : measure of
) expected quality of x 2 )
R( the= Ew (||x (y) x?(y, 0)||wrt x ,
0
? ||x?(y, ) x0||2 .
R( ) = EwPlugin-estimator:
(y) = argmin R( ) x ? (y) (y)
I The optimal (theoretical) minimizes the risk.
The risk is unknown since it depends on x0.
Can we estimate the risk solely from x?(y, )?
16. How to choose the value of the parameter ?
Risk Minimization
Risk-based selection of
I RiskAveragetorisk:
associated : measure of
) expected quality of x 2 )
R( the= Ew (||x (y) x?(y, 0)||wrt x ,
0
? ||x?(y, ) x0||2 .
R( ) = EwPlugin-estimator:
(y) = argmin R( ) x ? (y) (y)
I The optimal (theoretical) minimizes the risk.
Ew is The risk is unknown ! usedepends on x0.
not accessible since it one observation.
But:
Can we estimate the risk solely from x?(y, )?
17. How to choose the value of the parameter ?
Risk Minimization
Risk-based selection of
I RiskAveragetorisk:
associated : measure of
) expected quality of x 2 )
R( the= Ew (||x (y) x?(y, 0)||wrt x ,
0
? ||x?(y, ) x0||2 .
R( ) = EwPlugin-estimator:
(y) = argmin R( ) x ? (y) (y)
I The optimal (theoretical) minimizes the risk.
Ew is The risk is unknown ! usedepends on x0.
not accessible since it one observation.
But:
x0 isCan weaccessible ! needs risk?(y, )?
not estimate the risk solely from x estimators.
28. Variations and Stability
Observations: y = x0 + w
Recovery:
1
x (y) 2 argmin || x y||2 + ||D⇤ x||1 (P (y))
x2RN 2
Remark: x (y) not always unique but
µ (y) = x (y) always unique.
29. Variations and Stability
Observations: y = x0 + w
Recovery:
1
x (y) 2 argmin || x y||2 + ||D⇤ x||1 (P (y))
x2RN 2
Remark: x (y) not always unique but
µ (y) = x (y) always unique.
Questions:
– When is y ! µ (y) di↵erentiable ?
– Formula for @µ (y).
30. TV-1D Polytope
TV-1D ball: B = {x ||D x||1 1} 1 1 0 0
0 1 1 0
Displayed in {x ⇥x, 1⇤ = 0} R3 D =
0 0 1 1
x (y) 2 argmin ||D⇤ x||1 1 0 0 1
y= x
Copyright c Mael
31. TV-1D Polytope
TV-1D ball: B = {x ||D x||1 1} 1 1 0 0
0 1 1 0
Displayed in {x ⇥x, 1⇤ = 0} R3 D =
0 0 1 1
x (y) 2 argmin ||D⇤ x||1 1 0 0 1
y= x
{x x
y=
x}
Copyright c Mael
32. TV-1D Polytope
TV-1D ball: B = {x ||D x||1 1} 1 1 0 0
0 1 1 0
Displayed in {x ⇥x, 1⇤ = 0} R3 D =
0 0 1 1
x (y) 2 argmin ||D⇤ x||1 1 0 0 1
y= x
{x x
y=
x}
Copyright c Mael
33. TV-1D Polytope
TV-1D ball: B = {x ||D x||1 1} 1 1 0 0
0 1 1 0
Displayed in {x ⇥x, 1⇤ = 0} R3 D =
0 0 1 1
x (y) 2 argmin ||D⇤ x||1 1 0 0 1
y= x
{x x
y=
x}
Copyright c Mael
34. Union of Subspaces Model
1
x (y) 2 argmin || x y||2 + ||D⇤ x||1 (P (y))
x2RN 2
Support of the solution:
I = {i (D⇤ x (y))i 6= 0}
J = Ic
35. Union of Subspaces Model
1
x (y) 2 argmin || x y||2 + ||D⇤ x||1 (P (y))
x2RN 2
x
Support of the solution:
I = {i (D⇤ x (y))i 6= 0}
I
J = Ic
1-D total variation: D x = (xi xi 1 )i
36. Union of Subspaces Model
1
x (y) 2 argmin || x y||2 + ||D⇤ x||1 (P (y))
x2RN 2
x
Support of the solution:
I = {i (D⇤ x (y))i 6= 0}
I
J = Ic
1-D total variation: D x = (xi xi 1 )i
Sub-space model: GJ = ker(DJ ) = Im(DJ )
37. Union of Subspaces Model
1
x (y) 2 argmin || x y||2 + ||D⇤ x||1 (P (y))
x2RN 2
x
Support of the solution:
I = {i (D⇤ x (y))i 6= 0}
I
J = Ic
1-D total variation: D x = (xi xi 1 )i
Sub-space model: GJ = ker(DJ ) = Im(DJ )
Local well-posedness: ker( ) GJ = {0} (HJ )
Lemma: There exists a solution x? such that (HJ ) holds.
38. Local Sign Stability
1
x (y) 2 argminx || x y||2 + ||D⇤ x||1 (P (y))
2
Lemma: sign(D⇤ x (y)) is constant around (y, ) 2 H.
/
To be understood: there exists a solution with same sign.
2 1 2
1
2 1
0 2
1
2 1 2
dim(GJ )
39. Local Sign Stability
1
x (y) 2 argminx || x y||2 + ||D⇤ x||1 (P (y))
2
Lemma: sign(D⇤ x (y)) is constant around (y, ) 2 H.
/
To be understood: there exists a solution with same sign.
⇤
Linearized problem: = sign(DI x (y))
1
x ¯ (¯) = argmin || x
ˆ y y ||2 + ¯ DI x, sI
¯
x GJ 2
2 1 2
1
2 1
0 2
1
2 1 2
dim(GJ )
40. Local Sign Stability
1
x (y) 2 argminx || x y||2 + ||D⇤ x||1 (P (y))
2
Lemma: sign(D⇤ x (y)) is constant around (y, ) 2 H.
/
To be understood: there exists a solution with same sign.
⇤
Linearized problem: = sign(DI x (y))
1
x ¯ (¯) = argmin || x y ||2 + ¯ DI x, sI
ˆ y ¯
x GJ 2
= A[J] y ¯ DI sI
¯
1
A z = argmin || x||2
[J]
x, z 2 1 2
x GJ 2 1
2 1
0 2
1
2 1 2
dim(GJ )
41. Local Sign Stability
1
x (y) 2 argminx || x y||2 + ||D⇤ x||1 (P (y))
2
Lemma: sign(D⇤ x (y)) is constant around (y, ) 2 H.
/
To be understood: there exists a solution with same sign.
⇤
Linearized problem: = sign(DI x (y))
1
x ¯ (¯) = argmin || x y ||2 + ¯ DI x, sI
ˆ y ¯
x GJ 2
= A[J] y ¯ DI sI
¯
1
A z = argmin || x||2
[J]
x, z 2 1 2
x GJ 2 1
2 1
0
Theorem: If (y, ) / H, for (¯, ¯ )
y 2
1
2 1
close to (y, ), x ¯ (¯) is a solution of P ¯ (¯).
ˆ y y 2
dim(GJ )
42. Local Affine Maps
Local parameterization: x ¯ (¯) = A[J]
ˆ y ¯
y ¯ A[J] DI sI
Under uniqueness assumption:
y 7! x (y)
are piecewise a ne functions.
7! x (y)
breaking points
x 0+ change of support of D⇤ x (y)
x k
=0
0 =0 k
43. Application to GSURE
⇤
For y 2 H, one has locally:
/ µ (y) = A[J] y + cst.
Corollary: Let I = supp(D⇤ x (y)) such that HJ holds.
df (y) = tr (µ (y)) = dim( A[J] ⇤ )
df (y) = ||x (y)||0 for D = Id (synthesis)
gdf (y) = tr(⇧A[J] )
are unbiased estimators of df and gdf.
44. Application to GSURE
⇤
For y 2 H, one has locally:
/ µ (y) = A[J] y + cst.
Corollary: Let I = supp(D⇤ x (y)) such that HJ holds.
df (y) = tr (µ (y)) = dim( A[J] ⇤ )
df (y) = ||x (y)||0 for D = Id (synthesis)
gdf (y) = tr(⇧A[J] )
are unbiased estimators of df and gdf.
Trick: tr(A) = Ez (hAz, zi), z ⇠ N (0, IdP ).
45. Application to GSURE
⇤
For y 2 H, one has locally:
/ µ (y) = A[J] y + cst.
Corollary: Let I = supp(D⇤ x (y)) such that HJ holds.
df (y) = tr (µ (y)) = dim( A[J] ⇤ )
df (y) = ||x (y)||0 for D = Id (synthesis)
gdf (y) = tr(⇧A[J] )
are unbiased estimators of df and gdf.
Trick: tr(A) = Ez (hAz, zi), z ⇠ N (0, IdP ).
Proposition: gdf (y) = Ez (h⌫(z), + zi), z ⇠ N (0, IdP )
✓ ⇤ ◆✓ ◆ ✓ ⇤ ◆
DJ ⌫(z) z
where ⌫(z) solves ⇤ =
DJ 0 ⌫
˜ 0
1
PK +
In practice: gdf (y) ⇡ K k=1 h⌫(zk ), zk i, zk ⇠ N (0, IdP ).
46. pressed-sensing using multi-scale wavelet thresholding
CS with Wavelets
Quadr
1.5
6
x 106
2.5 x 10
2 RP ⇥N 2.5
realization of a random vector.
P 1 N/4
=
(b) x?(y, ) at the optimal 2 4 6 8 10 12
Regularization parameter λ
D: wavelet TI redundant tight frame.
Compressed-sensing using multi-scale wavelet thresholding
2
Quadratic loss
2
Quadratic loss
6
x 10
2.5
+ Projection Risk
y
(c) xM L GSURE
(c) xM L True Risk
Quadratic loss
2
Quadratic loss
1.5
(c) xM L 1.5
1.5
1
x x ? (y)the optimal 1
2
14 ?2
6 8 4 10 12 6
?
(d) x (y, ) at the optimal
?
(d) (y, ) at 2 4
Regularization parameter λ 6
47. where, for any z 2 RP , ⌫ = ⌫(z) solves the following linear system
Anisotropic Total-Variation
✓ ⇤
DJ
◆✓ ◆ ✓ ⇤ ◆
⌫ zx 1066
⇤ = .
DJ 0 ⌫
˜ 2.5 0
2.5
I : vertical sub-sampling.
In practice, with law of large number, the empirical mean is replaced for the expectation.
I The computation of ⌫(z) is achieved by solving the linear system with a conjugate gradient solver
Finite di↵erences gradient:
Numerical example D = [@1 , @2 ]
Super-resolution using (anisotropic) Total-Variation
Observations y 2
2
(a) y
Quadratic loss
(a) y
Quadratic loss
6
x 10
2.5
Projection Risk
GSURE
True Risk
Quadratic loss
2
(a) y
Quadratic loss
1.5
1.5
1.5
1 1
1
? x (y)
(b) x?(y, ? at the optimal
) 2 4 ?6 8 10 12
48. Overview
• Synthesis vs. Analysis Regularization
• Local Behavior of Sparse Regularization
• Robustness to Noise
• Numerical Illustrations
50. Identifiability Criterion
Identifiability criterion of a sign: we suppose (HJ ) holds
IC(s) = min || sI u|| (convex ! computable)
u Ker DJ
where = DJ (Id
+
A[J] )DI
Discrete 1-D derivative:
D x = (xi xi 1 )i x
= Id (denoising)
51. Identifiability Criterion
Identifiability criterion of a sign: we suppose (HJ ) holds
IC(s) = min || sI u|| (convex ! computable)
u Ker DJ
where = DJ (Id
+
A[J] )DI
Discrete 1-D derivative:
D x = (xi xi 1 )i x
= Id (denoising)
+1
IC(s) = || J || J
sI = sign(DI x) sI
1
J = sI IC(sign(D x)) < 1
52. Robustness to Small Noise
IC(s) = min || sI u|| = DJ (Id
+
A[J] )DI
u Ker DJ
53. Robustness to Small Noise
IC(s) = min || sI u|| = DJ (Id
+
A[J] )DI
u Ker DJ
Theorem: If IC(sign(D x0 )) < 1 T = min |(D x0 )i |
i I
If ||w||/T is small enough and ||w||, then
x0 + A[J] w A[J] DI sI ,
is the unique solution of P (y).
54. Robustness to Small Noise
IC(s) = min || sI u|| = DJ (Id
+
A[J] )DI
u Ker DJ
Theorem: If IC(sign(D x0 )) < 1 T = min |(D x0 )i |
i I
If ||w||/T is small enough and ||w||, then
x0 + A[J] w A[J] DI sI ,
is the unique solution of P (y).
Theorem: If IC(sign(D x0 )) < 1
x0 is the unique solution of P0 ( x0 ).
55. Robustness to Small Noise
IC(s) = min || sI u|| = DJ (Id
+
A[J] )DI
u Ker DJ
Theorem: If IC(sign(D x0 )) < 1 T = min |(D x0 )i |
i I
If ||w||/T is small enough and ||w||, then
x0 + A[J] w A[J] DI sI ,
is the unique solution of P (y).
Theorem: If IC(sign(D x0 )) < 1
x0 is the unique solution of P0 ( x0 ).
When D = Id, results of J.J. Fuchs.
56. Robustness to Bounded Noise
Robustness criterion: RC(I) = max min || pI u||
||pI || 1 u ker(DJ )
= IC(p)
57. Robustness to Bounded Noise
Robustness criterion: RC(I) = max min || pI u||
||pI || 1 u ker(DJ )
= IC(p)
Theorem: If RC(I) < 1 for I = Supp(D x0 ), setting
cJ
= ⇥||w||2 with ⇥ > 1
1 RC(I)
P (y) has a unique solution x⇥ GJ and
||x0 x ||2 CJ ||w||2
Constants: cJ = ||DJ
+
( A[J] Id)||2,
cJ
CJ = ||A [J]
||2,2 || ||2,2 + ||DI ||2,
1 RC(I)
When D = Id, results of Tropp (ERC)
59. Source Condition
Noiseless CNS: x0 2 argmin ||D⇤ x||1
y= x D↵
(SC↵ ) x0
() 9↵ 2 @||D⇤ x0 ||1 , D↵ 2 Im( ⇤
x
)
||D⇤ x||1 6 c
y=
Theorem: If (HJ ), (SC↵ ) and ||↵J ||1 < 1, then
||D⇤ (x? x0 )|| = O(||w||)
[Grassmair, Inverse Prob., 2011]
60. Source Condition
Noiseless CNS: x0 2 argmin ||D⇤ x||1
y= x D↵
(SC↵ ) x0
() 9↵ 2 @||D⇤ x0 ||1 , D↵ 2 Im( ⇤
x
)
||D⇤ x||1 6 c
y=
Theorem: If (HJ ), (SC↵ ) and ||↵J ||1 < 1, then
||D⇤ (x? x0 )|| = O(||w||)
[Grassmair, Inverse Prob., 2011]
⇢
⇤ ↵J = ⌦sI u
Proposition: Let s = sign(D x0 ) and
↵I = sI
Then IC(s) < 1 =) (SC↵ ) and ||↵J ||1 = IC(s).
IC(s) = min || sI u||
u Ker DJ
61. Overview
• Synthesis vs. Analysis Regularization
• Local Behavior of Sparse Regularization
• Robustness to Noise
• Numerical Illustrations
62. Example: Total Variation in 1-D
Discrete 1-D derivative: {GJ dim GJ = k}
x
D x = (xi xi 1 )i
Denoising = Id.
Signals with k 1 steps
63. Example: Total Variation in 1-D
Discrete 1-D derivative: {GJ dim GJ = k}
x
D x = (xi xi 1 )i
Denoising = Id.
= sign(DI x) Signals with k 1 steps
IC(s) = || J ||
I
J = sign(DI x)
x
+1
J
I
1
IC(sign(D x)) < 1
Small noise robustness
64. Example: Total Variation in 1-D
Discrete 1-D derivative: {GJ dim GJ = k}
x
D x = (xi xi 1 )i
Denoising = Id.
= sign(DI x) Signals with k 1 steps
IC(s) = || J ||
I
J = sign(DI x)
x
x
+1 +1
J J
I
1 1
IC(sign(D x)) < 1 IC(sign(D x)) = 1
Small noise robustness
67. Example: Total Variation in 2-D
Directional derivatives:
D1 x = (xi,j xi 1,j )i,j RN
D2 x = (xi,j xi,j 1 )i,j RN
Gradient:
D x = (D1 x, D2 x) RN 2
68. Example: Total Variation in 2-D
IC(s) < 1 IC(s) > 1
x x
Directional derivatives:
D1 x = (xi,j xi 1,j )i,j RN
D2 x = (xi,j xi,j 1 )i,j RN
Gradient: I I
D x = (D1 x, D2 x) RN 2
Dual vector:
I = sign(DI x)
J = sign(DI x)
69. Example: Fused Lasso
{GJ dim GJ = k}
Total variation and 1
hybrid: x
D x = (xi xi 1 )i ( xi )i
Compressed sensing:
Sum of k interval indicators.
Gaussian 2 RQ⇥N .
70. Example: Fused Lasso
{GJ dim GJ = k}
Total variation and 1
hybrid: x
D x = (xi xi 1 )i ( xi )i
Compressed sensing:
Sum of k interval indicators.
Gaussian 2 RQ⇥N .
Q
Probabilit´ P (⌘, ", N ) of the even IC< 1.
e 0 1
⌘ ⌘
x0
⌘
⌘
Q
" N
71. Example: Invariant Haar Analysis
Haar wavelets: 8 (j)
1 < +1 if 0 6 i < 2j
(j)
i = 1 if 2j 6 i < 0
2⌧ (j+1) :
0 otherwise.
Haar TI analysis:
(j)
||D x||1 = j ||x (j)
||1 = j ||x (j)
||TV
2j+1
72. be written as the sum over scales of the TV semi-norms of
filtered versions of the signal. As such, it can be understood as
◆ Example: Invariant Haar Analysis
a sort of multiscale total variation regularization. Apart from a
1l4 , multiplicative factor, one recovers Total Variation when Jmax =
M 8
Haar wavelets:
0.
< +1 if 0 6 < 2j
We consider a noiseless convolution setting (for Ni= 256)
(j)
is =
1
where (j) a circular convolution operator with a2j 6 i < 0
Gaussian
i 1 if
kernel of standard deviation :We first study the impact of
2⌧ (j+1) . 0 otherwise.
on the identifiability criterion IC. The blurred signal x⌘ is a
plays
Haar TI analysis:
centered boxcar signal with a support of size 2⌘N
(j)
worth
||D x||1 = j ||x ||1 =
(j)
j ⇤||x
x⌘ = 1{bN/2 ⌘N c,...,bN/2+⌘N c} , ⌘ 2 (0, 1/2] . (j)
||TV
Figure 3 displays the evolution of IC(sign(DH x0 ) as a
function of for three dictionaries (⌧ = 1, ⌧ = 0.5 and the
De-blurring: x = ' ? x, '(t) / e
t2
2 2
2j+1
Total Variation), where we fixed ⌘ = 0.2. In the identifiability
i ⌧ =1
⌧ = 1/2 x0
IC
TV
2
> 0
x? as
nates
74. Conclusion
SURE risk estimation:
Computing risk estimates
() Sensitivity analysis.
Open problem:
Fast algorithms to optimize .
SURE
?
75. Conclusion
SURE risk estimation:
Computing risk estimates
() Sensitivity analysis.
Open problem:
Fast algorithms to optimize .
Analysis vs. synthesis regularization:
Analysis support is less stable.
SURE
?
76. Conclusion
SURE risk estimation:
Computing risk estimates
() Sensitivity analysis.
Open problem:
Fast algorithms to optimize .
Analysis vs. synthesis regularization:
Analysis support is less stable.
SURE
Open problem:
Stability without support stability. ?