The document discusses representation learning and the manifold hypothesis. It explains that real-world data can be thought of as concentrating near a lower-dimensional manifold embedded within a high-dimensional space. Representation learning involves modeling the structure of this data-supporting manifold. The geometric notion of manifold provides an important perspective for representation learning, with the goal of learning an intrinsic coordinate system on the embedded manifold.
7. • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society. Series B (Methodological), 267-288.
(Google scholar 21305)
• Donoho, D. L. (2006). Compressed sensing. IEEE Transactions on
information theory, 52(4), 1289-1306. ( 19534)
• Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive
field properties by learning a sparse code for natural images. Nature,
381(6583), 607. ( 4765)
• Candes, E. J., & Tao, T. (2005). Decoding by linear programming. IEEE
transactions on information theory, 51(12), 4203-4215. ( 5488)
• Candes, E., & Tao, T. (2007). The Dantzig selector: Statistical estimation
when p is much larger than n. The Annals of Statistics, 2313-2351. (
2603)
• Candès, E. J., Romberg, J., & Tao, T. (2006). Robust uncertainty principles:
Exact signal reconstruction from highly incomplete frequency information.
IEEE Transactions on information theory, 52(2), 489-509. ( 12285)
20. •
•
Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives.
IEEE transactions on pattern analysis and machine intelligence, 35(8), 1798-1828.
8.2.
s is
uto-
ared
r f
ostly
sful
PSD
ding
gni-
lied
106],
k of
1.2).
wing
usly
hðtÞ
k2
2;
LEARNING
Another important perspective on representation learning is
based on the geometric notion of manifold. Its premise is the
manifold hypothesis, according to which real-world data
presented in high-dimensional spaces are expected to
concentrate in the vicinity of a manifold M of much lower
dimensionality dM, embedded in high-dimensional input
space IRdx
. This prior seems particularly well suited for AI
tasks such as those involving images, sounds, or text, for
which most uniformly sampled input configurations are
unlike natural stimuli. As soon as there is a notion of
“representation,” one can think of a manifold by consider-
ing the variations in input space which are captured by or
reflected (by corresponding changes) in the learned repre-
sentation. To first approximation, some directions are well
preserved (the tangent directions of the manifold), while
others are not (directions orthogonal to the manifolds). With
this perspective, the primary unsupervised learning task is
then seen as modeling the structure of the data-supporting
manifold.18
The associated representation being learned can
be associated with an intrinsic coordinate system on the
embedded manifold. The archetypal manifold modeling
algorithm is, not surprisingly, also the archetypal low-
Bengio et al. (2013)
多様体とは?(感覚的説明)
• 見かけは違うが、実質的にはd次元ユーク
リッド空間で表現できるような図形
• 「局所的に地図が書けるような図形」とも言え
る(例:地球表面)
3次元中に埋め込まれた、1次元多様体 同じく、2次元多様体(「スイスロール」)
多様体とは?(感覚的説明)
• 見かけは違うが、実質的にはd次元ユーク
リッド空間で表現できるような図形
• 「局所的に地図が書けるような図形」とも言え
る(例:地球表面)
3次元中に埋め込まれた、1次元多様体 同じく、2次元多様体(「スイスロール」)
22. • || y - A x ||2 x
•
β
i=1
1
n
E{xi,yi}n
i=1
2
n
i=1
l(yi, x⊤
i
ˆβ) + 2|
= E{xi,yi}n
i=1
EX,Y 2l(Y, X⊤ ˆβ)
ˆx = arg min
x
1
2
||y − Ax||2
+ λ||x||
[1] Alonso, W.: Location and Land Use, Harvard University Press, 1964.
[2] Mills, E.S.: An aggregative model of resource allocation in a metro
Economic Review, Vol.57, No.2, pp.197–210, 1967.
[3] Muth, R.F.: Cities and Housing, University of Chicago Press, 1969.
[4] Bairoch, P.: Cities and Economic Development: From the Dawn of
University of Chicago Press, 1988.
[5] Hohenberg, P., Lees, L.H.: The Making of Urban Europe (1000-195
i=1
= E{xi,yi}n
i=1
EX,Y 2l(Y, X⊤ ˆβ) + O(
1
n2
) (3)
ˆx = arg min
x
1
2
||y − Ax||2
+ λ||x|| (4)
ˆx = arg min
x
1
2
||y − Ax||2
(5)
||x|| ≤ k (6)
Land Use, Harvard University Press, 1964.
ve model of resource allocation in a metropolitan area, American
No.2, pp.197–210, 1967.
ousing, University of Chicago Press, 1969.
Economic Development: From the Dawn of History to the Present,
subject to
24. •
•
• β L0
•
•
•
•
•
•
min
β
2
n
i=1
l(yi, x⊤
i β) + 2||β||0
Location and Land Use, Harvard University Press, 1964.
An aggregative model of resource allocation in a metropolitan area, Amer
view, Vol.57, No.2, pp.197–210, 1967.
Cities and Housing, University of Chicago Press, 1969.
Cities and Economic Development: From the Dawn of History to the Pres
Chicago Press, 1988.
P., Lees, L.H.: The Making of Urban Europe (1000-1950), Harvard Univer
determination of bid rents through bidding procedures, Journal of Urban E
+ 2||β||0 (1)
versity Press, 1964.
llocation in a metropolitan area, American
.
icago Press, 1969.
From the Dawn of History to the Present,
an Europe (1000-1950), Harvard University
: 2017 9 21
l(yi, x⊤
i β) + 2||β||0 (1)
1
n
E{xi,yi}n
i=1
2
n
i=1
l(yi, x⊤
i
ˆβ) + 2||ˆβ||0 (2)
= E{xi,yi}n
i=1
EX,Y 2l(Y, X⊤ ˆβ) + O(
1
n2
) (3)
on and Land Use, Harvard University Press, 1964.
ggregative model of resource allocation in a metropolitan area, American
Vol.57, No.2, pp.197–210, 1967.
and Housing, University of Chicago Press, 1969.
25. •
•
•
•
Bayesでは事後確率は
観測データの確率×事前確率
事後確率を最大化するパラメタηを求めたい
ここで対数尤度にしてみると、次のように解釈で
|log|logmaxargˆ
||maxargˆ
PXP
PXP パラは事前分布のハイパー
損失関数 正則化項
Bayesでは事後確率は
観測データの確率×事前確率
事後確率を最大化するパラメタηを求めたい
ここで対数尤度にしてみると、次のように解釈でき
|log|logmaxargˆ
||maxargˆ
PXP
PXP パラメは事前分布のハイパー
損失関数 正則化項
Bayesでは事後確率は
観測データの確率×事前確率
事後確率を最大化するパラメタηを求めたい
ここで対数尤度にしてみると、次のように解釈できる
|log|logmaxargˆ
||maxargˆ
PXP
PXP パラメタは事前分布のハイパー
損失関数 正則化項
ノルムによる正則化項
とすると 事前分布の重みをここで、
も同様にすると事前分布
L2
2
),(
2
1
maxarg
,0
2
1
),(
2
1
minarg
),|(log),|(logminarg
2/),|(log
,|
2/),()1,),(|(log)1,|(log
)1,0()(
2
2
2
wwwx
wwwx
ww,x
www
w
wxwxw,x
wx
w
w
w
T
i
ii
T
i
ii
i
ii
T
i
ii
i
ii
i
ii
φy
φy
pyp
p
p
φyφyNyp
Nφy
事前分布のwの
分散:λー1 とも見
える。
例:事前分布がLaplace分布、事後分布が正規分布
も同様にすると分布の事前分布は期待値
)|(log),|(logminarg
2
)|(log
2
exp
4
|0
2/),()1,),(|(log)1,|(log
)1,0()(
2
ww,x
w
w
w
w
wxwxw,x
wx
ii
i
ii
i
ii
i
ii
pyp
p
pLaplace
φyφyNyp
Nφy
例:事前分布がLaplace分布、事後分布が正規分布
ノルムによる正則化項
も同様にすると分布の事前分布は期待値
L1
2
),(
2
1
minarg
)|(log),|(logminarg
2
)|(log
2
exp
4
|0
2/),()1,),(|(log)1,|(log
)1,0()(
2
2
wwx
ww,x
w
w
w
w
wxwxw,x
wx
w
w
i
ii
i
ii
i
ii
i
ii
i
ii
φy
pyp
p
pLaplace
φyφyNyp
Nφy
27. •
•
•
•
•
•
||x|| ≤ k
w
L(w)
f(w) = L(w) + λ||w||1
n
Rd
f(w) = min
w∈Rd
L(w) + λ||w||1 (
Use, Harvard University Press, 1964.
odel of resource allocation in a metropolitan area, Ameri
, pp.197–210, 1967.
||x|| ≤ k (6
w (7
L(w) (8
f(w) = L(w) + λ||w||1 (9
n
Rd
f(w) = min
w∈Rd
L(w) + λ||w||1 (10
Use, Harvard University Press, 1964.
odel of resource allocation in a metropolitan area, America
, pp.197–210, 1967.
ˆx = arg min
x
1
2
||y − Ax||2
||x|| ≤ k
w
L(w)
f(w) = L(w) + λ||w||1
min
w∈Rd
f(w) = min
w∈Rd
L(w) + λ||w||1
and Land Use, Harvard University Press, 1964.
ˆx = arg min
x
1
2
||y − Ax||2
||x|| ≤ k
w
L(w)
f(w) = L(w) + λ||w||1
min
w∈Rd
f(w) = min
w∈Rd
L(w) + λ||w||1
28. • η
• η
• wj ηj
•
•
w
L(w)
f(w) = L(w) + λ||w||1
min
w∈Rd
f(w) = min
w∈Rd
L(w) + λ||w||1
||w||1 =
d
j=1
|wj| =
1
2
d
j=1
min
η∈Rd:ηj ≥0
w2
j
ηj
+ ηj
W.: Location and Land Use, Harvard University Press, 1964.
S.: An aggregative model of resource allocation in a metropolitan area, A
1
w
L(w)
f(w) = L(w) + λ||w||1
min
w∈Rd
f(w) = min
w∈Rd
L(w) + λ||w||1
||w||1 =
d
j=1
|wj| =
1
2
d
j=1
min
η∈Rd:ηj ≥0
w2
j
ηj
+ ηj
w2
j
ηj
+ ηj ≥ 2||w||1
1
L(w
f(w) = L(w) + λ||w|
min
w∈Rd
f(w) = min
w∈Rd
L(w) + λ||w|
||w||1 =
d
j=1
|wj| =
1
2
d
j=1
min
η∈Rd:ηj ≥0
w
ηj
w2
j
ηj
+ ηj ≥ 2||w||1
ηj = |wj|
1
29. • η
•
•
•
min
w∈Rd
L(w) + λ||w||1 = min
w,η∈Rd,ηj ≥0
L(w) +
λ
2
d
j=1
w2
j
ηj
+
λ
2
d
j=1
ηj
W.: Location and Land Use, Harvard University Press, 1964.
.S.: An aggregative model of resource allocation in a metropolitan area,
ic Review, Vol.57, No.2, pp.197–210, 1967.
R.F.: Cities and Housing, University of Chicago Press, 1969.
P.: Cities and Economic Development: From the Dawn of History to the
ty of Chicago Press, 1988.
erg, P., Lees, L.H.: The Making of Urban Europe (1000-1950), Harvard U
985.
Y.: A determination of bid rents through bidding procedures, Journal of U
||w||1 =
d
j=1
|wj| =
1
2
d
j=1
min
η∈Rd:ηj ≥0
w2
j
ηj
+ ηj ≥ 2||w||1
ηj = |wj|
1
min
w∈Rd
L(w) + λ||w||1 = min
w,η∈Rd,ηj ≥0
L(w) +
λ
2
d
j=1
w2
j
ηj
+
λ
2
d
j=1
ηj (14)
1. j = 1, . . . , d η1
j = 1
2.
a wt
wt
= arg min
w∈Rd
⎛
⎝L(w) +
λ
2
d
j=1
w2
j
ηt
j
⎞
⎠ (15)
b ηt+1
j
ηt+1
j = |wt
j| j = 1, . . . , d (16)
[1] Alonso, W.: Location and Land Use, Harvard University Press, 1964.
30. •
•
•
•
•
wt
= arg min
w∈Rd
⎛
⎝L(w) +
λ
2
d
j=1
w2
j
ηt
j
⎞
⎠
1
ηt+1
j = |wt
j| j = 1, . . . , d
proxg(y) = arg min
w∈Rd
1
2
||y − w||2
2 + g(w)
W.: Location and Land Use, Harvard University Press, 1964.
S.: An aggregative model of resource allocation in a metropolitan area, A
Review, Vol.57, No.2, pp.197–210, 1967.
F.: Cities and Housing, University of Chicago Press, 1969.
31. •
•
•
•
j j
proxg(y) = arg min
w∈Rd
1
2
||y − w||2
2 + g(w)
proxl1
λ (y) = arg min
w∈Rd
1
2
||y − w||2
2 + λ||w||1
proxl1
λ (y)
j
=
⎧
⎨
⎩
yj + λ, if yj −λ
0, if − λ ≤ yj ≤ λ j = 1, . . . , d
yj − λ, if yj λ
: Location and Land Use, Harvard University Press, 1964.
An aggregative model of resource allocation in a metropolitan area, Am
Review, Vol.57, No.2, pp.197–210, 1967.
: Cities and Housing, University of Chicago Press, 1969.
j
ηt+1
j = |wt
j| j = 1, . . . , d (
proxg(y) = arg min
w∈Rd
1
2
||y − w||2
2 + g(w) (
proxl1
λ (y) = arg min
w∈Rd
1
2
||y − w||2
2 + λ||w||1 (
proxl1
λ (y)
j
=
⎧
⎨
⎩
yj + λ, if yj −λ
0, if − λ ≤ yj ≤ λ j = 1, . . . , d
yj − λ, if yj λ
, W.: Location and Land Use, Harvard University Press, 1964.
E.S.: An aggregative model of resource allocation in a metropolitan area, Ameri
mic Review, Vol.57, No.2, pp.197–210, 1967.
R.F.: Cities and Housing, University of Chicago Press, 1969.
z
ST(z)
λ
-λ
λ
λ
λ
32. •
•
• j
•
•
•
prox 1
λ (y) = arg min
w∈Rd 2
||y − w||2 + λ||w||1 (
proxl1
λ (y)
j
=
⎧
⎨
⎩
yj + λ, if yj −λ
0, if − λ ≤ yj ≤ λ j = 1, . . . , d
yj − λ, if yj λ
1
2
||y − w||2
2 + λ||w||1 =
d
j=1
1
2
(yj − wj)2
+ λ|wj| (
yj − wj ∈ λ∂|wj| j = 1, . . . , d (
, W.: Location and Land Use, Harvard University Press, 1964.
E.S.: An aggregative model of resource allocation in a metropolitan area, Ameri
mic Review, Vol.57, No.2, pp.197–210, 1967.
R.F.: Cities and Housing, University of Chicago Press, 1969.
h, P.: Cities and Economic Development: From the Dawn of History to the Prese
sity of Chicago Press, 1988.
prox 1
λ (y) = arg min
w∈Rd 2
||y − w||2
2 + λ||w||1
proxl1
λ (y)
j
=
⎧
⎨
⎩
yj + λ, if yj −λ
0, if − λ ≤ yj ≤ λ j = 1, . . . , d
yj − λ, if yj λ
1
2
||y − w||2
2 + λ||w||1 =
d
j=1
1
2
(yj − wj)2
+ λ|wj|
yj − wj ∈ λ∂|wj| j = 1, . . . , d
Location and Land Use, Harvard University Press, 1964.
An aggregative model of resource allocation in a metropolitan area, Am
Review, Vol.57, No.2, pp.197–210, 1967.
roxl1
λ (y) = arg min
w∈Rd
1
2
||y − w||2
2 + λ||w||1 (18)
y)
j
=
⎧
⎨
⎩
yj + λ, if yj −λ
0, if − λ ≤ yj ≤ λ j = 1, . . . , d
yj − λ, if yj λ
− w||2
2 + λ||w||1 =
d
j=1
1
2
(yj − wj)2
+ λ|wj| (19)
yj − wj ∈ λ∂|wj| j = 1, . . . , d (20)
∂|w| =
⎧
⎨
⎩
−1, if w 0
[−1, 1], if w = 0
1, if w 0
w
|w|
33. • f wt
• w0
• ηt
•
•
•
yj − wj ∈ λ∂|wj| j = 1, . . . , d
∂|w| =
⎧
⎨
⎩
−1, if w 0
[−1, 1], if w = 0
1, if w 0
wt+1
= arg min
w
∇L(wt
)(w − wt
) + λ||w||1 +
1
2ηt
||w − wt
||2
2
so, W.: Location and Land Use, Harvard University Press, 1964.
s, E.S.: An aggregative model of resource allocation in a metropolitan area, Ame
nomic Review, Vol.57, No.2, pp.197–210, 1967.
h, R.F.: Cities and Housing, University of Chicago Press, 1969.
2
yj − wj ∈ λ∂|wj| j = 1, . . . , d
∂|w| =
⎧
⎨
⎩
−1, if w 0
[−1, 1], if w = 0
1, if w 0
wt+1
= arg min
w
∇L(wt
)(w − wt
) + λ||w||1 +
1
2ηt
||w − wt
||2
2
wt+1
= proxl1
λ,ηt
wt
− ηt∇L(wt
)
W.: Location and Land Use, Harvard University Press, 1964.
34. •
wt+1
= arg min
w
∇L(wt
)(w − wt
) + λ||w||1 +
1
2ηt
||w − w
wt+1
= proxl1
λ,ηt
wt
− ηt∇L(wt
)
[1] Alonso, W.: Location and Land Use, Harvard University Press, 1964.
2
35. • X L
•
•
•
•
•
1, if w 0
wt+1
= arg min
w
∇L(wt
)(w − wt
) + λ||w||1 +
1
2ηt
||w − wt
||2
2 (21)
wt+1
= proxl1
λ,ηt
wt
− ηt∇L(wt
) (22)
min
x∈Rn
f(x) (23)
2
gj(x) = 0 j = 1, . . . , p (24)
g(x) = (g1(x), . . . , gp(x))⊤
(25)
W.: Location and Land Use, Harvard University Press, 1964.
S.: An aggregative model of resource allocation in a metropolitan area, American
gj(x) = 0 j = 1, . . . , p
g(x) = (g1(x), . . . , gp(x))⊤
nso, W.: Location and Land Use, Harvard University Press, 1964.
s, E.S.: An aggregative model of resource allocation in a metropolitan area, Am
nomic Review, Vol.57, No.2, pp.197–210, 1967.
s.t. gj(x) = 0 j = 1, . . . , p (2
g(x) = (g1(x), . . . , gp(x))⊤
(2
Lρ(x, y) = f(x) + y⊤
g(x) +
ρ
2
||g(x)||2
2 (2
36. •
•
•
• x*
y*
g(x) = (g1(x), . . . , gp(x))⊤
(25)
Lρ(x, y) = f(x) + y⊤
g(x) +
ρ
2
||g(x)||2
2 (26)
Location and Land Use, Harvard University Press, 1964.
An aggregative model of resource allocation in a metropolitan area, American
Review, Vol.57, No.2, pp.197–210, 1967.
: Cities and Housing, University of Chicago Press, 1969.
: Cities and Economic Development: From the Dawn of History to the Present,
of Chicago Press, 1988.
P., Lees, L.H.: The Making of Urban Europe (1000-1950), Harvard University
A determination of bid rents through bidding procedures, Journal of Urban Eco-
.27, Issue.2, pp.188–211, 1990.
gj(x) = 0 j = 1, . . . , p (24)
g(x) = (g1(x), . . . , gp(x))⊤
(25)
Lρ(x, y) = f(x) + y⊤
g(x) +
ρ
2
||g(x)||2
2 (26)
min
x∈Rn
max
y∈Rp
Lρ(x, y) (27)
.: Location and Land Use, Harvard University Press, 1964.
.: An aggregative model of resource allocation in a metropolitan area, American
Review, Vol.57, No.2, pp.197–210, 1967.
s.t.
gj(x) = 0 j = 1, . . . , p
g(x) = (g1(x), . . . , gp(x))
Lρ(x, y) = f(x) + y⊤
g(x) +
ρ
2
|
min
x∈Rn
max
y∈Rp
Lρ(x, y)
∇g1(x∗
), . . . , ∇gp(x∗
)
gj(x) = 0 j = 1, . . . , p (24)
g(x) = (g1(x), . . . , gp(x))⊤
(25)
Lρ(x, y) = f(x) + y⊤
g(x) +
ρ
2
||g(x)||2
2 (26)
min
x∈Rn
max
y∈Rp
Lρ(x, y) (27)
∇g1(x∗
), . . . , ∇gp(x∗
) (28)
∇f(x∗
) +
p
j=1
y∗
j ∇gj(x∗
) = 0 (29)
gj(x∗
) = 0, j = 1, . . . , p (30)
(3.1)
(3.2)
(3.3)
37. •
• x*
•
• x* x
y* y
• y*
x*
• x* y*
∇f(x∗
) +
p
j=1
y∗
j ∇gj(x∗
) = 0
gj(x∗
) = 0, j = 1, . . . , p
∇xLρ(x, y) = ∇f(x) +
p
j=1
yj∇gj(x) + ρ
p
j=1
gj(x)∇gj(x)
2
min
x∈Rn
max
y∈Rp
Lρ(x, y)
∇g1(x∗
), . . . , ∇gp(x∗
)
∇f(x∗
) +
p
j=1
y∗
j ∇gj(x∗
) = 0
gj(x∗
) = 0, j = 1, . . . , p
∇g1(x∗
), . . . , ∇gp(x∗
)
∇f(x∗
) +
p
j=1
y∗
j ∇gj(x∗
) = 0
gj(x∗
) = 0, j = 1, . . . , p
∇xLρ(x, y) = ∇f(x) +
p
j=1
yj∇gj(x) + ρ
p
j=1
gj(x)∇gj(x)
∇xLρ(x, y∗
)|x=x∗ = ∇f(x∗
) +
p
j=1
y∗
j ∇gj(x∗
) = 0
39. •
•
• f
•
•
1. y0
2. xk+1 ||∇xLρk
(xk+1, yk)|| ≤ ϵk
ρk 0 ϵk ≥ 0 ϵk → 0
3. yk+1 ← yk + ρkg(xk+1)
4. k ← k + 1
f : Rn
→ R ∪ {+∞}
3
1. y0
2. xk+1 ||∇xLρk
(xk+1, yk)|| ≤ ϵk
ρk 0 ϵk ≥ 0 ϵk → 0
3. yk+1 ← yk + ρkg(xk+1)
4. k ← k + 1
f : Rn
→ R ∪ {+∞}
f∗
(s) = sup{⟨s, x⟩ − f(x)|x ∈ Rn
}
f∗
: Rn
→ R ∪ {+∞}
f → f∗
1. y0
2. xk+1 ||∇xLρk
(xk+1, yk)|| ≤ ϵk
ρk 0 ϵk ≥ 0 ϵk → 0
3. yk+1 ← yk + ρkg(xk+1)
4. k ← k + 1
f : Rn
→ R ∪ {+∞}
f∗
(s) = sup{⟨s, x⟩ − f(x)|x ∈ Rn
}
f∗
: Rn
→ R ∪ {+∞}
f → f∗
∇xLρ(x, y∗
)|x=x∗ = ∇f(x∗
) +
j=1
y∗
j ∇gj(
1. y0
2. xk+1 ||∇xLρk
(xk+1, yk)|| ≤ ϵk
ρk 0 ϵk ≥ 0 ϵk → 0
3. yk+1 ← yk + ρkg(xk+1)
4. k ← k + 1
f : Rn
→ R ∪ {+∞}
f∗
(s) = sup{⟨s, x⟩ − f(x)|x ∈ Rn
}
f∗
: Rn
→ R ∪ {+∞}
f → f∗
✲
x(1)
✲
x(2)
✲
x(3)
✲
✻
f
x(4)
✲
✻
f
x(5)
✲
✻
f
x(6)
34
✲
x
✻
y
f(x)
p
−f•(p)
✲
x
✻
y
40. •
• f g
•
•
•
4. k ← k + 1
f : Rn
→ R ∪ {+∞}
f∗
(s) = sup{⟨s, x⟩ − f(x)|x ∈ Rn
}
f∗
: Rn
→ R ∪ {+∞}
f → f∗
X ∈ Rn×d
min
w∈Rd
(f(Xw) + g(w)) = min
α∈R
w∗
, α∗
w∗
∈ ∂g∗
(
α∗
∈ −∂f
3
→ R ∪ {+∞}
= sup{⟨s, x⟩ − f(x)|x ∈ Rn
}
→ R ∪ {+∞}
n×d
min
w∈Rd
(f(Xw) + g(w)) = min
α∈Rn
−f∗
(−α) − g∗
(X⊤
α)
w∗
∈ ∂g∗
(X⊤
α∗
)
α∗
∈ −∂f(Xw∗
)
3
4. k ← k + 1
f : Rn
→ R ∪ {+∞}
f∗
(s) = sup{⟨s, x⟩ − f(x)|x ∈ Rn
}
f∗
: Rn
→ R ∪ {+∞}
f → f∗
X ∈ Rn×d
min
w∈Rd
(f(Xw) + g(w)) = min
α∈Rn
−f∗
w∗
, α∗
w∗
∈ ∂g∗
(X⊤
α∗
α∗
∈ −∂f(Xw∗
3
2
+∞}
, x⟩ − f(x)|x ∈ Rn
}
{+∞}
min
w∈Rd
(f(Xw) + g(w)) = min
α∈Rn
−f∗
(−α) − g∗
(X⊤
α) (33)
w∗
∈ ∂g∗
(X⊤
α∗
) (34)
α∗
∈ −∂f(Xw∗
) (35)
(36)
3
41. • L
•
•
min
w∈Rd
fl(Xw) + λ||w||1 (37
max
α∈Rn
−f∗
l (−α) − δ||·||∞≤λ(X⊤
α) (38
min
w∈Rd
fl(Xw) + λ||w||1 (3
max
α∈Rn
−f∗
l (−α) − δ||·||∞≤λ(X⊤
α) (3
fl
λ|| ||1
λ
min
w∈Rd
fl(Xw) + λ||w||1
max
α∈Rn
−f∗
l (−α) − δ||·||∞≤λ(X⊤
α)
δ||·||∞≤λ(v) =
0, if ||v||∞ ≤ λ
+∞, if otherwise
min
w∈Rd
fl(Xw) + λ||w||1 (
max
α∈Rn
−f∗
l (−α) − δ||·||∞≤λ(X⊤
α) (
δ||·||∞≤λ(v) =
0, if ||v||∞ ≤ λ
+∞, if otherwise
min
α∈Rn,v∈Rd
f∗
l (−α) + δ||·||∞≤λ(v) (
X⊤
α = v (
η
min
w∈Rd
fl(Xw) + λ||w||1 (37)
max
α∈Rn
−f∗
l (−α) − δ||·||∞≤λ(X⊤
α) (38)
δ||·||∞≤λ(v) =
0, if ||v||∞ ≤ λ
+∞, if otherwise
min
α∈Rn,v∈Rd
f∗
l (−α) + δ||·||∞≤λ(v) (39)
X⊤
α = v (40)
s.t.
42. •
•
•
max
α∈Rn
−f∗
l (−α) − δ||·||∞≤λ(X⊤
α) (38)
δ||·||∞≤λ(v) =
0, if ||v||∞ ≤ λ
+∞, if otherwise
min
α∈Rn,v∈Rd
f∗
l (−α) + δ||·||∞≤λ(v) (39)
X⊤
α = v (40)
min
α∈Rn,v∈Rd
f∗
l (−α) + δ||·||∞≤λ(v) +
η
2
||X⊤
α − v||2
2 (41)
) = f∗
l (−α) + δ||·||∞≤λ(v) + w⊤
(X⊤
α − v) +
η
2
||X⊤
α − v||2
2 (42)
s.t.
min
α∈Rn,v∈Rd
f∗
l (−α) + δ||·||∞≤λ(v)
X⊤
α = v
min
α∈Rn,v∈Rd
f∗
l (−α) + δ||·||∞≤λ(v) +
η
2
||X⊤
α − v||2
2
Lη(α, v, w) = f∗
l (−α) + δ||·||∞≤λ(v) + w⊤
(X⊤
α − v) +
η
2
||X⊤
α − v||2
2
δ||·||∞≤λ(v) =
0, if ||v||∞ ≤ λ
+∞, if otherwise
min
α∈Rn,v∈Rd
f∗
l (−α) + δ||·||∞≤λ(v)
X⊤
α = v
min
α∈Rn,v∈Rd
f∗
l (−α) + δ||·||∞≤λ(v) +
η
2
||X⊤
α − v||2
2
Lη(α, v, w) = f∗
l (−α) + δ||·||∞≤λ(v) + w⊤
(X⊤
α − v) +
η
2
||X⊤
α − v||2
2
δ||·||∞≤λ(v) =
0, if ||v||∞ ≤ λ
+∞, if otherwise
min
α∈Rn,v∈Rd
f∗
l (−α) + δ||·||∞≤λ(v)
X⊤
α = v
min
α∈Rn,v∈Rd
f∗
l (−α) + δ||·||∞≤λ(v) +
η
2
||X⊤
α − v||2
2
Lη(α, v, w) = f∗
l (−α) + δ||·||∞≤λ(v) + w⊤
(X⊤
α − v) +
η
2
||X⊤
α − v||2
2
max
w∈Rd
min
α∈Rn,v∈Rd
Lη(α, v, w)
43. •
•
•
•
• (Tomioka and Sugiyama, 2009)
•
•
Lη(α, v, w) = f∗
l (−α) + δ||·||∞≤λ(v) + w⊤
(X⊤
α − v) +
η
2
||X⊤
α − v||2
2
max
w∈Rd
min
α∈Rn,v∈Rd
Lη(α, v, w)
α∈Rn,v∈Rd
l ∞
2 2
Lη(α, v, w) = f∗
l (−α) + δ||·||∞≤λ(v) + w⊤
(X⊤
α − v) +
η
2
||X
max
w∈Rd
min
α∈Rn,v∈Rd
Lη(α, v, w)
α, v
(αt+1
, vt+1
) = arg min
α∈Rn,v∈Rd
Lη(α, v, wt
)
wt
wt+1
= wt
+ η X⊤
αt+1
− vt+1
α∈Rn,v∈Rd 2
Lη(α, v, w) = f∗
l (−α) + δ||·||∞≤λ(v) + w⊤
(X⊤
α − v) +
η
2
||X⊤
α − v||2
2
max
w∈Rd
min
α∈Rn,v∈Rd
Lη(α, v, w)
(αt+1
, vt+1
) = arg min
α∈Rn,v∈Rd
Lη(α, v, wt
)
wt+1
= wt
+ η X⊤
αt+1
− vt+1
α∈Rn,v∈Rd
l |
Lη(α, v, w) = f∗
l (−α) + δ||·||∞≤λ
max
w∈Rd
min
α∈Rn,v
α, v
(αt+1
, vt+1
) = arg
wt
wt+1
= wt
+ η
α∈Rn,v∈Rd 2
Lη(α, v, w) = f∗
l (−α) + δ||·||∞≤λ(v) + w⊤
(X⊤
α − v) +
η
2
||X⊤
α − v||2
2
max
w∈Rd
min
α∈Rn,v∈Rd
Lη(α, v, w)
(αt+1
, vt+1
) = arg min
α∈Rn,v∈Rd
Lη(α, v, wt
)
wt+1
= wt
+ η X⊤
αt+1
− vt+1
(3.4)
δ||·||∞≤λ(v) =
0, if ||v||∞ ≤ λ
+∞, if otherwise
min
α∈Rn,v∈Rd
f∗
l (−α) + δ||·||∞≤λ(v)
X⊤
α = v
min
α∈Rn,v∈Rd
f∗
l (−α) + δ||·||∞≤λ(v) +
η
2
||X⊤
α − v
Lη(α, v, w) = f∗
l (−α) + δ||·||∞≤λ(v) + w⊤
(X⊤
α − v) +
η
2
max
w∈Rd
min
α∈Rn,v∈Rd
Lη(α, v, w)
α, v
(αt+1
, vt+1
) = arg min
α∈Rn,v∈Rd
Lη(α, v, wt
)
wt
min
α∈Rn,v∈Rd
f∗
l (−α) + δ||·||∞≤λ(v) +
η
2
||X⊤
α − v||2
2
Lη(α, v, w) = f∗
l (−α) + δ||·||∞≤λ(v) + w⊤
(X⊤
α − v) +
η
2
||X⊤
α − v||2
2
max
w∈Rd
min
α∈Rn,v∈Rd
Lη(α, v, w)
(αt+1
, vt+1
) = arg min
α∈Rn,v∈Rd
Lη(α, v, wt
)
wt+1
= wt
+ η X⊤
αt+1
− vt+1
min
v∈Rd
Lη(α, v, wt
) = min
v∈Rd
1
2η
||ηv − ˆwt
||2
2 + δ||·||∞≤λ(v) + const.
wt
+ ηX⊤
α
min
α∈Rn,v∈Rd
f∗
l (−α) + δ||·||∞≤λ(v) +
η
2
||X⊤
α −
Lη(α, v, w) = f∗
l (−α) + δ||·||∞≤λ(v) + w⊤
(X⊤
α − v) +
max
w∈Rd
min
α∈Rn,v∈Rd
Lη(α, v, w)
α, v
(αt+1
, vt+1
) = arg min
α∈Rn,v∈Rd
Lη(α, v, wt
wt
wt+1
= wt
+ η X⊤
αt+1
− vt+1
min
v∈Rd
Lη(α, v, wt
) = min
v∈Rd
1
2η
||ηv − ˆwt
||2
2 + δ||·||∞≤λ(
ˆwt
= wt
+ ηX⊤
α
84. 1. Kataoka, S., Yasuda, M., Furtlehner, C., and Tanaka, K., : Traffic
data reconstruction based on Markov random field modeling,
Inverse Problems, 30025003, 2014.
2. Freedman, J., Hastie, T. and Tibshirani, R., :Sparse inverse
covariance estimation with the graphical lasso, Biostatistics, 9, 3,
pp. 432-441, 2008.
3. Mazumder, R., and Hastie, T. : The graphical lasso: New insights
and alternatives. Electronic journal of statistics, 6, pp. 2125-2149,
2012.
4. Dempster, A. P., Laird, N. M., and Rubin, D. B., :Maximum
Likelihood from Incomplete Data via the EM Algorithm, Journal
of the Royal Statistical Society. Series B (Methodological), 39, 1,
pp.1-38, 1977.
5. , , , :
, 12
ITS 2014 Peer-Review Proceedings, CD-ROM,
2014.
84
85. • Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society. Series B (Methodological), 267-288.
(Google scholar 21305)
• Donoho, D. L. (2006). Compressed sensing. IEEE Transactions on
information theory, 52(4), 1289-1306. ( 19534)
• Olshausen, B. A., Field, D. J. (1996). Emergence of simple-cell receptive
field properties by learning a sparse code for natural images. Nature,
381(6583), 607. ( 4765)
• Candes, E. J., Tao, T. (2005). Decoding by linear programming. IEEE
transactions on information theory, 51(12), 4203-4215. ( 5488)
• Candes, E., Tao, T. (2007). The Dantzig selector: Statistical estimation
when p is much larger than n. The Annals of Statistics, 2313-2351. (
2603)
• Candès, E. J., Romberg, J., Tao, T. (2006). Robust uncertainty principles:
Exact signal reconstruction from highly incomplete frequency information.
IEEE Transactions on information theory, 52(2), 489-509. ( 12285)
86. •
•
•
•
•
•
•
•
• Rish and Grabarnik, Sparse Modeling Theory, Algorithms, and Applications,
CRC Press, 2014.
•
• Elder and Kutyniok, Compressed Sensing Theory and Applications, Cambridge
University Press, 2012.
•
87. • Cover, T. M., Van Campenhout, J. M. (1977). On the possible orderings in the measurement selection
problem. IEEE Transactions on Systems, Man, and Cybernetics, 7(9), 657-661.
• Bengio, Y., Courville, A., Vincent, P. (2013). Representation learning: A review and new perspectives.
IEEE transactions on pattern analysis and machine intelligence, 35(8), 1798-1828.
• Tomioka, R., Sugiyama, M. (2009). Dual-augmented Lagrangian method for efficient sparse
reconstruction. IEEE Signal Processing Letters, 16(12), 1067-1070.
• Eckstein, J., Bertsekas, D. P. (1992). On the Douglas—Rachford splitting method and the proximal
point algorithm for maximal monotone operators. Mathematical Programming, 55(1), 293-318.
• Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J. (2011). Distributed optimization and statistical
learning via the alternating direction method of multipliers. Foundations and Trends® in Machine
Learning, 3(1), 1-122.
• Fu, W. J. (1998). Penalized regressions: the bridge versus the lasso. Journal of computational and
graphical statistics, 7(3), 397-416.
• Daubechies, I., Defrise, M., De Mol, C. (2004). An iterative thresholding algorithm for linear inverse
problems with a sparsity constraint. Communications on pure and applied mathematics, 57(11), 1413-
1457.
• Friedman, J., Hastie, T., Höfling, H., Tibshirani, R. (2007). Pathwise coordinate optimization. The
Annals of Applied Statistics, 1(2), 302-332.
• Wu, T. T., Lange, K. (2008). Coordinate descent algorithms for lasso penalized regression. The
Annals of Applied Statistics, 224-244.
• Beck, A., Tetruashvili, L. (2013). On the convergence of block coordinate descent type methods.
SIAM journal on Optimization, 23(4), 2037-2060.