スパースモデリング

•
•
•
•
•
•
•
•
•
•
•

• Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society. Series B (Methodological), 267-288.
(Google scholar 21305)
• Donoho, D. L. (2006). Compressed sensing. IEEE Transactions on
information theory, 52(4), 1289-1306. ( 19534)
• Olshausen, B. A., & Field, D. J. (1996). Emergence of simple-cell receptive
field properties by learning a sparse code for natural images. Nature,
381(6583), 607. ( 4765)
• Candes, E. J., & Tao, T. (2005). Decoding by linear programming. IEEE
transactions on information theory, 51(12), 4203-4215. ( 5488)
• Candes, E., & Tao, T. (2007). The Dantzig selector: Statistical estimation
when p is much larger than n. The Annals of Statistics, 2313-2351. (
2603)
• Candès, E. J., Romberg, J., & Tao, T. (2006). Robust uncertainty principles:
Exact signal reconstruction from highly incomplete frequency information.
IEEE Transactions on information theory, 52(2), 489-509. ( 12285)

•
•
•
•
•
•
•
•
•
•
•
•

OR
SP (e.g. Uber)
FD
Macroscopic Fundamental
Diagram (MFD)
Built environment

•
•
O(eN)
(Cover and van Campenhout, 1977)
•
•
•
•
•
•
•
•

•
•
•
2
0
-2
0.5
0
0
1
0.7
0
0
-2
0.7
3
0
0.5
0.5
0
0
2
0
0
0
0.5
0
4
Σ =# = (0,0,0,0,0)

•
•
•
•
•
•
•
•

•
•
Bengio, Y., Courville, A., & Vincent, P. (2013). Representation learning: A review and new perspectives.
IEEE transactions on pattern analysis and machine intelligence, 35(8), 1798-1828.
8.2.
s is
uto-
ared
r f
ostly
sful
PSD
ding
gni-
lied
106],
k of
1.2).
wing
usly
hðtÞ
k2
2;
LEARNING
Another important perspective on representation learning is
based on the geometric notion of manifold. Its premise is the
manifold hypothesis, according to which real-world data
presented in high-dimensional spaces are expected to
concentrate in the vicinity of a manifold M of much lower
dimensionality dM, embedded in high-dimensional input
space IRdx
. This prior seems particularly well suited for AI
tasks such as those involving images, sounds, or text, for
which most uniformly sampled input configurations are
unlike natural stimuli. As soon as there is a notion of
“representation,” one can think of a manifold by consider-
ing the variations in input space which are captured by or
reflected (by corresponding changes) in the learned repre-
sentation. To first approximation, some directions are well
preserved (the tangent directions of the manifold), while
others are not (directions orthogonal to the manifolds). With
this perspective, the primary unsupervised learning task is
then seen as modeling the structure of the data-supporting
manifold.18
The associated representation being learned can
be associated with an intrinsic coordinate system on the
embedded manifold. The archetypal manifold modeling
algorithm is, not surprisingly, also the archetypal low-
Bengio et al. (2013)
多様体とは？（感覚的説明）
• 見かけは違うが、実質的にはd次元ユーク
リッド空間で表現できるような図形
• 「局所的に地図が書けるような図形」とも言え
る（例：地球表面）
3次元中に埋め込まれた、1次元多様体同じく、2次元多様体（「スイスロール」）
多様体とは？（感覚的説明）
• 見かけは違うが、実質的にはd次元ユーク
リッド空間で表現できるような図形
• 「局所的に地図が書けるような図形」とも言え
る（例：地球表面）
3次元中に埋め込まれた、1次元多様体同じく、2次元多様体（「スイスロール」）

• x A y
• y = A x A
y x
•
•
A
y x

• || y - A x ||2 x
•
β
i=1
1
n
E{xi,yi}n
i=1
2
n
i=1
l(yi, x⊤
i
ˆβ) + 2|
= E{xi,yi}n
i=1
EX,Y 2l(Y, X⊤ ˆβ)
ˆx = arg min
x
1
2
||y − Ax||2
+ λ||x||
[1] Alonso, W.: Location and Land Use, Harvard University Press, 1964.
[2] Mills, E.S.: An aggregative model of resource allocation in a metro
Economic Review, Vol.57, No.2, pp.197–210, 1967.
[3] Muth, R.F.: Cities and Housing, University of Chicago Press, 1969.
[4] Bairoch, P.: Cities and Economic Development: From the Dawn of
University of Chicago Press, 1988.
[5] Hohenberg, P., Lees, L.H.: The Making of Urban Europe (1000-195
i=1
= E{xi,yi}n
i=1
EX,Y 2l(Y, X⊤ ˆβ) + O(
1
n2
) (3)
ˆx = arg min
x
1
2
||y − Ax||2
+ λ||x|| (4)
ˆx = arg min
x
1
2
||y − Ax||2
(5)
||x|| ≤ k (6)
Land Use, Harvard University Press, 1964.
ve model of resource allocation in a metropolitan area, American
No.2, pp.197–210, 1967.
ousing, University of Chicago Press, 1969.
Economic Development: From the Dawn of History to the Present,
subject to

( ) = * |,-|
.
-/)
( 0 = * ,-
0
.
-/)
( 2 = max |,-|
( ) + 7 ( 0
0

•
•
• β L0
•
•
•
•
•
•
min
β
2
n
i=1
l(yi, x⊤
i β) + 2||β||0
Location and Land Use, Harvard University Press, 1964.
An aggregative model of resource allocation in a metropolitan area, Amer
view, Vol.57, No.2, pp.197–210, 1967.
Cities and Housing, University of Chicago Press, 1969.
Cities and Economic Development: From the Dawn of History to the Pres
Chicago Press, 1988.
P., Lees, L.H.: The Making of Urban Europe (1000-1950), Harvard Univer
determination of bid rents through bidding procedures, Journal of Urban E
+ 2||β||0 (1)
versity Press, 1964.
llocation in a metropolitan area, American
.
icago Press, 1969.
From the Dawn of History to the Present,
an Europe (1000-1950), Harvard University
: 2017 9 21
l(yi, x⊤
i β) + 2||β||0 (1)
1
n
E{xi,yi}n
i=1
2
n
i=1
l(yi, x⊤
i
ˆβ) + 2||ˆβ||0 (2)
= E{xi,yi}n
i=1
EX,Y 2l(Y, X⊤ ˆβ) + O(
1
n2
) (3)
on and Land Use, Harvard University Press, 1964.
ggregative model of resource allocation in a metropolitan area, American
Vol.57, No.2, pp.197–210, 1967.
and Housing, University of Chicago Press, 1969.

•
•
•
•
Bayesでは事後確率は
観測データの確率×事前確率
事後確率を最大化するパラメタηを求めたい
ここで対数尤度にしてみると、次のように解釈で
|log|logmaxargˆ
||maxargˆ
PXP
PXP パラは事前分布のハイパー
損失関数正則化項
ここで対数尤度にしてみると、次のように解釈でき
|log|logmaxargˆ
||maxargˆ
PXP
PXP パラメは事前分布のハイパー
ここで対数尤度にしてみると、次のように解釈できる
|log|logmaxargˆ
||maxargˆ
PXP
PXP パラメタは事前分布のハイパー
ノルムによる正則化項　　　　
とすると　　事前分布の重みをここで、
　　　
　　
も同様にすると事前分布
L2
2
),(
2
1
maxarg
,0
2
1
),(
2
1
minarg
),|(log),|(logminarg
2/),|(log
,|
2/),()1,),(|(log)1,|(log
)1,0()(
2
2
2
wwwx
wwwx
ww,x
www
w
wxwxw,x
wx
w
w
w
T
i
ii
T
i
ii
i
ii
T
i
ii
i
ii
i
ii
φy
φy
pyp
p
p
φyφyNyp
Nφy
事前分布のwの
分散:λー1 とも見
える。
例：事前分布がLaplace分布、事後分布が正規分布
　　
も同様にすると分布の事前分布は期待値
)|(log),|(logminarg
2
)|(log
2
exp
4
|0
2/),()1,),(|(log)1,|(log
)1,0()(
2
ww,x
w
w
w
w
wxwxw,x
wx
ii
i
ii
i
ii
i
ii
pyp
p
pLaplace
φyφyNyp
Nφy
例：事前分布がLaplace分布、事後分布が正規分布
ノルムによる正則化項　　　　　
　　
も同様にすると分布の事前分布は期待値
L1
2
),(
2
1
minarg
)|(log),|(logminarg
2
)|(log
2
exp
4
|0
2/),()1,),(|(log)1,|(log
)1,0()(
2
2
wwx
ww,x
w
w
w
w
wxwxw,x
wx
w
w
i
ii
i
ii
i
ii
i
ii
i
ii
φy
pyp
p
pLaplace
φyφyNyp
Nφy

•
•
•
•
•
•
•
•
•
•

•
•
•
•
•
•
||x|| ≤ k
w
L(w)
f(w) = L(w) + λ||w||1
n
Rd
f(w) = min
w∈Rd
L(w) + λ||w||1 (
Use, Harvard University Press, 1964.
odel of resource allocation in a metropolitan area, Ameri
, pp.197–210, 1967.
||x|| ≤ k (6
w (7
L(w) (8
f(w) = L(w) + λ||w||1 (9
n
Rd
f(w) = min
w∈Rd
L(w) + λ||w||1 (10
Use, Harvard University Press, 1964.
odel of resource allocation in a metropolitan area, America
, pp.197–210, 1967.
ˆx = arg min
x
1
2
||y − Ax||2
||x|| ≤ k
w
L(w)
f(w) = L(w) + λ||w||1
min
w∈Rd
f(w) = min
w∈Rd
L(w) + λ||w||1
and Land Use, Harvard University Press, 1964.
ˆx = arg min
x
1
2
||y − Ax||2
||x|| ≤ k
w
L(w)
f(w) = L(w) + λ||w||1
min
w∈Rd
f(w) = min
w∈Rd
L(w) + λ||w||1

• η
• η
• wj ηj
•
•
w
L(w)
f(w) = L(w) + λ||w||1
min
w∈Rd
f(w) = min
w∈Rd
L(w) + λ||w||1
||w||1 =
d
j=1
|wj| =
1
2
d
j=1
min
η∈Rd:ηj ≥0
w2
j
ηj
+ ηj
W.: Location and Land Use, Harvard University Press, 1964.
S.: An aggregative model of resource allocation in a metropolitan area, A
1
w
L(w)
f(w) = L(w) + λ||w||1
min
w∈Rd
f(w) = min
w∈Rd
L(w) + λ||w||1
||w||1 =
d
j=1
|wj| =
1
2
d
j=1
min
η∈Rd:ηj ≥0
w2
j
ηj
+ ηj
w2
j
ηj
+ ηj ≥ 2||w||1
1
L(w
f(w) = L(w) + λ||w|
min
w∈Rd
f(w) = min
w∈Rd
L(w) + λ||w|
||w||1 =
d
j=1
|wj| =
1
2
d
j=1
min
η∈Rd:ηj ≥0
w
ηj
w2
j
ηj
+ ηj ≥ 2||w||1
ηj = |wj|
1

• η
•
•
•
min
w∈Rd
L(w) + λ||w||1 = min
w,η∈Rd,ηj ≥0
L(w) +
λ
2
d
j=1
w2
j
ηj
+
λ
2
d
j=1
ηj
.S.: An aggregative model of resource allocation in a metropolitan area,
ic Review, Vol.57, No.2, pp.197–210, 1967.
R.F.: Cities and Housing, University of Chicago Press, 1969.
P.: Cities and Economic Development: From the Dawn of History to the
ty of Chicago Press, 1988.
erg, P., Lees, L.H.: The Making of Urban Europe (1000-1950), Harvard U
985.
Y.: A determination of bid rents through bidding procedures, Journal of U
||w||1 =
d
j=1
|wj| =
1
2
d
j=1
min
η∈Rd:ηj ≥0
w2
j
ηj
+ ηj ≥ 2||w||1
ηj = |wj|
1
min
w∈Rd
L(w) + λ||w||1 = min
w,η∈Rd,ηj ≥0
L(w) +
λ
2
d
j=1
w2
j
ηj
+
λ
2
d
j=1
ηj (14)
1. j = 1, . . . , d η1
j = 1
2.
a wt
wt
= arg min
w∈Rd
⎛
⎝L(w) +
λ
2
d
j=1
w2
j
ηt
j
⎞
⎠ (15)
b ηt+1
j
ηt+1
j = |wt
j| j = 1, . . . , d (16)

•
•
•
•
•
wt
= arg min
w∈Rd
⎛
⎝L(w) +
λ
2
d
j=1
w2
j
ηt
j
⎞
⎠
1
ηt+1
j = |wt
j| j = 1, . . . , d
proxg(y) = arg min
w∈Rd
1
2
||y − w||2
2 + g(w)
S.: An aggregative model of resource allocation in a metropolitan area, A
Review, Vol.57, No.2, pp.197–210, 1967.
F.: Cities and Housing, University of Chicago Press, 1969.

•
•
•
•
j j
proxg(y) = arg min
w∈Rd
1
2
||y − w||2
2 + g(w)
proxl1
λ (y) = arg min
w∈Rd
1
2
||y − w||2
2 + λ||w||1
proxl1
λ (y)
j
=
⎧
⎨
⎩
yj + λ, if yj −λ
0, if − λ ≤ yj ≤ λ j = 1, . . . , d
yj − λ, if yj λ
: Location and Land Use, Harvard University Press, 1964.
An aggregative model of resource allocation in a metropolitan area, Am
Review, Vol.57, No.2, pp.197–210, 1967.
: Cities and Housing, University of Chicago Press, 1969.
j
ηt+1
j = |wt
j| j = 1, . . . , d (
proxg(y) = arg min
w∈Rd
1
2
||y − w||2
2 + g(w) (
proxl1
λ (y) = arg min
w∈Rd
1
2
||y − w||2
2 + λ||w||1 (
proxl1
λ (y)
j
=
⎧
⎨
⎩
0, if − λ ≤ yj ≤ λ j = 1, . . . , d
yj − λ, if yj λ
, W.: Location and Land Use, Harvard University Press, 1964.
E.S.: An aggregative model of resource allocation in a metropolitan area, Ameri
mic Review, Vol.57, No.2, pp.197–210, 1967.
z
ST(z)
λ
-λ
λ
λ
λ

•
•
• j
•
•
•
prox 1
λ (y) = arg min
w∈Rd 2
||y − w||2 + λ||w||1 (
proxl1
λ (y)
j
=
⎧
⎨
⎩
0, if − λ ≤ yj ≤ λ j = 1, . . . , d
yj − λ, if yj λ
1
2
||y − w||2
2 + λ||w||1 =
d
j=1
1
2
(yj − wj)2
+ λ|wj| (
yj − wj ∈ λ∂|wj| j = 1, . . . , d (
, W.: Location and Land Use, Harvard University Press, 1964.
E.S.: An aggregative model of resource allocation in a metropolitan area, Ameri
mic Review, Vol.57, No.2, pp.197–210, 1967.
h, P.: Cities and Economic Development: From the Dawn of History to the Prese
sity of Chicago Press, 1988.
prox 1
λ (y) = arg min
w∈Rd 2
||y − w||2
2 + λ||w||1
proxl1
λ (y)
j
=
⎧
⎨
⎩
0, if − λ ≤ yj ≤ λ j = 1, . . . , d
yj − λ, if yj λ
1
2
||y − w||2
2 + λ||w||1 =
d
j=1
1
2
(yj − wj)2
+ λ|wj|
yj − wj ∈ λ∂|wj| j = 1, . . . , d
An aggregative model of resource allocation in a metropolitan area, Am
Review, Vol.57, No.2, pp.197–210, 1967.
roxl1
λ (y) = arg min
w∈Rd
1
2
||y − w||2
2 + λ||w||1 (18)
y)
j
=
⎧
⎨
⎩
0, if − λ ≤ yj ≤ λ j = 1, . . . , d
yj − λ, if yj λ
− w||2
2 + λ||w||1 =
d
j=1
1
2
(yj − wj)2
+ λ|wj| (19)
yj − wj ∈ λ∂|wj| j = 1, . . . , d (20)
∂|w| =
⎧
⎨
⎩
−1, if w 0
[−1, 1], if w = 0
1, if w 0
w
|w|

• f wt
• w0
• ηt
•
•
•
yj − wj ∈ λ∂|wj| j = 1, . . . , d
∂|w| =
⎧
⎨
⎩
−1, if w 0
[−1, 1], if w = 0
1, if w 0
wt+1
= arg min
w
∇L(wt
)(w − wt
) + λ||w||1 +
1
2ηt
||w − wt
||2
2
so, W.: Location and Land Use, Harvard University Press, 1964.
s, E.S.: An aggregative model of resource allocation in a metropolitan area, Ame
nomic Review, Vol.57, No.2, pp.197–210, 1967.
h, R.F.: Cities and Housing, University of Chicago Press, 1969.
2
yj − wj ∈ λ∂|wj| j = 1, . . . , d
∂|w| =
⎧
⎨
⎩
−1, if w 0
[−1, 1], if w = 0
1, if w 0
wt+1
= arg min
w
∇L(wt
)(w − wt
) + λ||w||1 +
1
2ηt
||w − wt
||2
2
wt+1
= proxl1
λ,ηt
wt
− ηt∇L(wt
)

•
wt+1
= arg min
w
∇L(wt
)(w − wt
) + λ||w||1 +
1
2ηt
||w − w
wt+1
= proxl1
λ,ηt
wt
− ηt∇L(wt
)
2

• X L
•
•
•
•
•
1, if w 0
wt+1
= arg min
w
∇L(wt
)(w − wt
) + λ||w||1 +
1
2ηt
||w − wt
||2
2 (21)
wt+1
= proxl1
λ,ηt
wt
− ηt∇L(wt
) (22)
min
x∈Rn
f(x) (23)
2
gj(x) = 0 j = 1, . . . , p (24)
g(x) = (g1(x), . . . , gp(x))⊤
(25)
S.: An aggregative model of resource allocation in a metropolitan area, American
gj(x) = 0 j = 1, . . . , p
g(x) = (g1(x), . . . , gp(x))⊤
nso, W.: Location and Land Use, Harvard University Press, 1964.
s, E.S.: An aggregative model of resource allocation in a metropolitan area, Am
nomic Review, Vol.57, No.2, pp.197–210, 1967.
s.t. gj(x) = 0 j = 1, . . . , p (2
g(x) = (g1(x), . . . , gp(x))⊤
(2
Lρ(x, y) = f(x) + y⊤
g(x) +
ρ
2
||g(x)||2
2 (2

•
•
•
• x*
y*
g(x) = (g1(x), . . . , gp(x))⊤
(25)
Lρ(x, y) = f(x) + y⊤
g(x) +
ρ
2
||g(x)||2
2 (26)
An aggregative model of resource allocation in a metropolitan area, American
Review, Vol.57, No.2, pp.197–210, 1967.
: Cities and Housing, University of Chicago Press, 1969.
: Cities and Economic Development: From the Dawn of History to the Present,
of Chicago Press, 1988.
P., Lees, L.H.: The Making of Urban Europe (1000-1950), Harvard University
A determination of bid rents through bidding procedures, Journal of Urban Eco-
.27, Issue.2, pp.188–211, 1990.
gj(x) = 0 j = 1, . . . , p (24)
g(x) = (g1(x), . . . , gp(x))⊤
(25)
Lρ(x, y) = f(x) + y⊤
g(x) +
ρ
2
||g(x)||2
2 (26)
min
x∈Rn
max
y∈Rp
Lρ(x, y) (27)
.: Location and Land Use, Harvard University Press, 1964.
.: An aggregative model of resource allocation in a metropolitan area, American
Review, Vol.57, No.2, pp.197–210, 1967.
s.t.
gj(x) = 0 j = 1, . . . , p
g(x) = (g1(x), . . . , gp(x))
Lρ(x, y) = f(x) + y⊤
g(x) +
ρ
2
|
min
x∈Rn
max
y∈Rp
Lρ(x, y)
∇g1(x∗
), . . . , ∇gp(x∗
)
gj(x) = 0 j = 1, . . . , p (24)
g(x) = (g1(x), . . . , gp(x))⊤
(25)
Lρ(x, y) = f(x) + y⊤
g(x) +
ρ
2
||g(x)||2
2 (26)
min
x∈Rn
max
y∈Rp
Lρ(x, y) (27)
∇g1(x∗
), . . . , ∇gp(x∗
) (28)
∇f(x∗
) +
p
j=1
y∗
j ∇gj(x∗
) = 0 (29)
gj(x∗
) = 0, j = 1, . . . , p (30)
(3.1)
(3.2)
(3.3)

•
• x*
•
• x* x
y* y
• y*
x*
• x* y*
∇f(x∗
) +
p
j=1
y∗
j ∇gj(x∗
) = 0
gj(x∗
) = 0, j = 1, . . . , p
∇xLρ(x, y) = ∇f(x) +
p
j=1
yj∇gj(x) + ρ
p
j=1
gj(x)∇gj(x)
2
min
x∈Rn
max
y∈Rp
Lρ(x, y)
∇g1(x∗
), . . . , ∇gp(x∗
)
∇f(x∗
) +
p
j=1
y∗
j ∇gj(x∗
) = 0
gj(x∗
) = 0, j = 1, . . . , p
∇g1(x∗
), . . . , ∇gp(x∗
)
∇f(x∗
) +
p
j=1
y∗
j ∇gj(x∗
) = 0
gj(x∗
) = 0, j = 1, . . . , p
∇xLρ(x, y) = ∇f(x) +
p
j=1
yj∇gj(x) + ρ
p
j=1
gj(x)∇gj(x)
∇xLρ(x, y∗
)|x=x∗ = ∇f(x∗
) +
p
j=1
y∗
j ∇gj(x∗
) = 0

•
•
•
∇xLρ(x, y) = ∇f(x) +
j=1
yj∇gj(x) + ρ
j=1
gj(x)∇gj(x)
∇xLρ(x, y∗
)|x=x∗ = ∇f(x∗
) +
p
j=1
y∗
j ∇gj(x∗
) = 0
1. y0
2. xk+1 ||∇xLρk
(xk+1, yk)|| ≤ ϵk
ρk 0 ϵk ≥ 0 ϵk → 0
3. yk+1 ← yk + ρkg(xk+1)
4. k ← k + 1 2

•
•
• f
•
•
1. y0
2. xk+1 ||∇xLρk
(xk+1, yk)|| ≤ ϵk
ρk 0 ϵk ≥ 0 ϵk → 0
3. yk+1 ← yk + ρkg(xk+1)
4. k ← k + 1
f : Rn
→ R ∪ {+∞}
3
1. y0
2. xk+1 ||∇xLρk
(xk+1, yk)|| ≤ ϵk
ρk 0 ϵk ≥ 0 ϵk → 0
3. yk+1 ← yk + ρkg(xk+1)
4. k ← k + 1
f : Rn
→ R ∪ {+∞}
f∗
(s) = sup{⟨s, x⟩ − f(x)|x ∈ Rn
}
f∗
: Rn
→ R ∪ {+∞}
f → f∗
1. y0
2. xk+1 ||∇xLρk
(xk+1, yk)|| ≤ ϵk
ρk 0 ϵk ≥ 0 ϵk → 0
3. yk+1 ← yk + ρkg(xk+1)
4. k ← k + 1
f : Rn
→ R ∪ {+∞}
f∗
(s) = sup{⟨s, x⟩ − f(x)|x ∈ Rn
}
f∗
: Rn
→ R ∪ {+∞}
f → f∗
∇xLρ(x, y∗
)|x=x∗ = ∇f(x∗
) +
j=1
y∗
j ∇gj(
1. y0
2. xk+1 ||∇xLρk
(xk+1, yk)|| ≤ ϵk
ρk 0 ϵk ≥ 0 ϵk → 0
3. yk+1 ← yk + ρkg(xk+1)
4. k ← k + 1
f : Rn
→ R ∪ {+∞}
f∗
(s) = sup{⟨s, x⟩ − f(x)|x ∈ Rn
}
f∗
: Rn
→ R ∪ {+∞}
f → f∗
✲
x(1)
✲
x(2)
✲
x(3)
✲
✻
f
x(4)
✲
✻
f
x(5)
✲
✻
f
x(6)
34
✲
x
✻
y
f(x)
p
−f•(p)
✲
x
✻
y

•
• f g
•
•
•
4. k ← k + 1
f : Rn
→ R ∪ {+∞}
f∗
(s) = sup{⟨s, x⟩ − f(x)|x ∈ Rn
}
f∗
: Rn
→ R ∪ {+∞}
f → f∗
X ∈ Rn×d
min
w∈Rd
(f(Xw) + g(w)) = min
α∈R
w∗
, α∗
w∗
∈ ∂g∗
(
α∗
∈ −∂f
3
→ R ∪ {+∞}
= sup{⟨s, x⟩ − f(x)|x ∈ Rn
}
→ R ∪ {+∞}
n×d
min
w∈Rd
α∈Rn
−f∗
(−α) − g∗
(X⊤
α)
w∗
∈ ∂g∗
(X⊤
α∗
)
α∗
∈ −∂f(Xw∗
)
3
4. k ← k + 1
f : Rn
→ R ∪ {+∞}
f∗
(s) = sup{⟨s, x⟩ − f(x)|x ∈ Rn
}
f∗
: Rn
→ R ∪ {+∞}
f → f∗
X ∈ Rn×d
min
w∈Rd
α∈Rn
−f∗
w∗
, α∗
w∗
∈ ∂g∗
(X⊤
α∗
α∗
∈ −∂f(Xw∗
3
2
+∞}
, x⟩ − f(x)|x ∈ Rn
}
{+∞}
min
w∈Rd
α∈Rn
−f∗
(−α) − g∗
(X⊤
α) (33)
w∗
∈ ∂g∗
(X⊤
α∗
) (34)
α∗
∈ −∂f(Xw∗
) (35)
(36)
3

• L
•
•
min
w∈Rd
fl(Xw) + λ||w||1 (37
max
α∈Rn
−f∗
l (−α) − δ||·||∞≤λ(X⊤
α) (38
min
w∈Rd
fl(Xw) + λ||w||1 (3
max
α∈Rn
−f∗
l (−α) − δ||·||∞≤λ(X⊤
α) (3
fl
λ|| ||1
λ
min
w∈Rd
fl(Xw) + λ||w||1
max
α∈Rn
−f∗
l (−α) − δ||·||∞≤λ(X⊤
α)
δ||·||∞≤λ(v) =
0, if ||v||∞ ≤ λ
+∞, if otherwise
min
w∈Rd
fl(Xw) + λ||w||1 (
max
α∈Rn
−f∗
l (−α) − δ||·||∞≤λ(X⊤
α) (
δ||·||∞≤λ(v) =
0, if ||v||∞ ≤ λ
+∞, if otherwise
min
α∈Rn,v∈Rd
f∗
l (−α) + δ||·||∞≤λ(v) (
X⊤
α = v (
η
min
w∈Rd
fl(Xw) + λ||w||1 (37)
max
α∈Rn
−f∗
l (−α) − δ||·||∞≤λ(X⊤
α) (38)
δ||·||∞≤λ(v) =
0, if ||v||∞ ≤ λ
+∞, if otherwise
min
α∈Rn,v∈Rd
f∗
l (−α) + δ||·||∞≤λ(v) (39)
X⊤
α = v (40)
s.t.

•
•
•
max
α∈Rn
−f∗
l (−α) − δ||·||∞≤λ(X⊤
α) (38)
δ||·||∞≤λ(v) =
0, if ||v||∞ ≤ λ
+∞, if otherwise
min
α∈Rn,v∈Rd
f∗
l (−α) + δ||·||∞≤λ(v) (39)
X⊤
α = v (40)
min
α∈Rn,v∈Rd
f∗
l (−α) + δ||·||∞≤λ(v) +
η
2
||X⊤
α − v||2
2 (41)
) = f∗
l (−α) + δ||·||∞≤λ(v) + w⊤
(X⊤
α − v) +
η
2
||X⊤
α − v||2
2 (42)
s.t.
min
α∈Rn,v∈Rd
f∗
l (−α) + δ||·||∞≤λ(v)
X⊤
α = v
min
α∈Rn,v∈Rd
f∗
l (−α) + δ||·||∞≤λ(v) +
η
2
||X⊤
α − v||2
2
Lη(α, v, w) = f∗
l (−α) + δ||·||∞≤λ(v) + w⊤
(X⊤
α − v) +
η
2
||X⊤
α − v||2
2
δ||·||∞≤λ(v) =
0, if ||v||∞ ≤ λ
+∞, if otherwise
min
α∈Rn,v∈Rd
f∗
l (−α) + δ||·||∞≤λ(v)
X⊤
α = v
min
α∈Rn,v∈Rd
f∗
l (−α) + δ||·||∞≤λ(v) +
η
2
||X⊤
α − v||2
2
Lη(α, v, w) = f∗
l (−α) + δ||·||∞≤λ(v) + w⊤
(X⊤
α − v) +
η
2
||X⊤
α − v||2
2
δ||·||∞≤λ(v) =
0, if ||v||∞ ≤ λ
+∞, if otherwise
min
α∈Rn,v∈Rd
f∗
l (−α) + δ||·||∞≤λ(v)
X⊤
α = v
min
α∈Rn,v∈Rd
f∗
l (−α) + δ||·||∞≤λ(v) +
η
2
||X⊤
α − v||2
2
Lη(α, v, w) = f∗
l (−α) + δ||·||∞≤λ(v) + w⊤
(X⊤
α − v) +
η
2
||X⊤
α − v||2
2
max
w∈Rd
min
α∈Rn,v∈Rd
Lη(α, v, w)

•
•
•
•
• (Tomioka and Sugiyama, 2009)
•
•
Lη(α, v, w) = f∗
l (−α) + δ||·||∞≤λ(v) + w⊤
(X⊤
α − v) +
η
2
||X⊤
α − v||2
2
max
w∈Rd
min
α∈Rn,v∈Rd
Lη(α, v, w)
α∈Rn,v∈Rd
l ∞
2 2
Lη(α, v, w) = f∗
l (−α) + δ||·||∞≤λ(v) + w⊤
(X⊤
α − v) +
η
2
||X
max
w∈Rd
min
α∈Rn,v∈Rd
Lη(α, v, w)
α, v
(αt+1
, vt+1
) = arg min
α∈Rn,v∈Rd
Lη(α, v, wt
)
wt
wt+1
= wt
+ η X⊤
αt+1
− vt+1
α∈Rn,v∈Rd 2
Lη(α, v, w) = f∗
l (−α) + δ||·||∞≤λ(v) + w⊤
(X⊤
α − v) +
η
2
||X⊤
α − v||2
2
max
w∈Rd
min
α∈Rn,v∈Rd
Lη(α, v, w)
(αt+1
, vt+1
) = arg min
α∈Rn,v∈Rd
Lη(α, v, wt
)
wt+1
= wt
+ η X⊤
αt+1
− vt+1
α∈Rn,v∈Rd
l |
Lη(α, v, w) = f∗
l (−α) + δ||·||∞≤λ
max
w∈Rd
min
α∈Rn,v
α, v
(αt+1
, vt+1
) = arg
wt
wt+1
= wt
+ η
α∈Rn,v∈Rd 2
Lη(α, v, w) = f∗
l (−α) + δ||·||∞≤λ(v) + w⊤
(X⊤
α − v) +
η
2
||X⊤
α − v||2
2
max
w∈Rd
min
α∈Rn,v∈Rd
Lη(α, v, w)
(αt+1
, vt+1
) = arg min
α∈Rn,v∈Rd
Lη(α, v, wt
)
wt+1
= wt
+ η X⊤
αt+1
− vt+1
(3.4)
δ||·||∞≤λ(v) =
0, if ||v||∞ ≤ λ
+∞, if otherwise
min
α∈Rn,v∈Rd
f∗
l (−α) + δ||·||∞≤λ(v)
X⊤
α = v
min
α∈Rn,v∈Rd
f∗
l (−α) + δ||·||∞≤λ(v) +
η
2
||X⊤
α − v
Lη(α, v, w) = f∗
l (−α) + δ||·||∞≤λ(v) + w⊤
(X⊤
α − v) +
η
2
max
w∈Rd
min
α∈Rn,v∈Rd
Lη(α, v, w)
α, v
(αt+1
, vt+1
) = arg min
α∈Rn,v∈Rd
Lη(α, v, wt
)
wt
min
α∈Rn,v∈Rd
f∗
l (−α) + δ||·||∞≤λ(v) +
η
2
||X⊤
α − v||2
2
Lη(α, v, w) = f∗
l (−α) + δ||·||∞≤λ(v) + w⊤
(X⊤
α − v) +
η
2
||X⊤
α − v||2
2
max
w∈Rd
min
α∈Rn,v∈Rd
Lη(α, v, w)
(αt+1
, vt+1
) = arg min
α∈Rn,v∈Rd
Lη(α, v, wt
)
wt+1
= wt
+ η X⊤
αt+1
− vt+1
min
v∈Rd
Lη(α, v, wt
) = min
v∈Rd
1
2η
||ηv − ˆwt
||2
2 + δ||·||∞≤λ(v) + const.
wt
+ ηX⊤
α
min
α∈Rn,v∈Rd
f∗
l (−α) + δ||·||∞≤λ(v) +
η
2
||X⊤
α −
Lη(α, v, w) = f∗
l (−α) + δ||·||∞≤λ(v) + w⊤
(X⊤
α − v) +
max
w∈Rd
min
α∈Rn,v∈Rd
Lη(α, v, w)
α, v
(αt+1
, vt+1
) = arg min
α∈Rn,v∈Rd
Lη(α, v, wt
wt
wt+1
= wt
+ η X⊤
αt+1
− vt+1
min
v∈Rd
Lη(α, v, wt
) = min
v∈Rd
1
2η
||ηv − ˆwt
||2
2 + δ||·||∞≤λ(
ˆwt
= wt
+ ηX⊤
α

• v
•
• v
α
• α α w
•
1. w0
2. φ(αt) αt+1
3. wt
wt+1
= wt
+ η X⊤
αt+1
− vt+1
min
v∈Rd
Lη(α, v, wt
) = min
v∈Rd
1
2η
||ηv − ˆwt
||2
2 + δ||·||∞≤λ(v) + const.
+ ηX⊤
α
wt+1
= wt
+ η X⊤
αt+1
− vt+1
(45
min
v∈Rd
Lη(α, v, wt
) = min
v∈Rd
1
2η
||ηv − ˆwt
||2
2 + δ||·||∞≤λ(v) + const. (46
X⊤
α
ˆwt
− ηvt+1
= proxl1
λ,η( ˆwt
) (47
wt+1
= wt
+ η X⊤
αt+1
− vt+1
min
v∈Rd
Lη(α, v, wt
) = min
v∈Rd
1
2η
||ηv − ˆwt
||2
2 + δ||·||∞≤λ(v) + const.
wt
+ ηX⊤
α
ˆwt
− ηvt+1
= proxl1
λ,η( ˆwt
)
φt(α) = f∗
l (−α) +
1
2ηt
proxl1
λ,ηt
( ˆwt
+ ηtX⊤
α)
2
2
(αt+1
, vt+1
) = arg min
α∈Rn,v∈Rd
Lη(α, v, wt
) (44
wt+1
= wt
+ η X⊤
αt+1
− vt+1
(45
min
v∈Rd
Lη(α, v, wt
) = min
v∈Rd
1
2η
||ηv − ˆwt
||2
2 + δ||·||∞≤λ(v) + const. (46
X⊤
α
ˆwt
− ηvt+1
= proxl1
λ,η( ˆwt
) (47
φt(α) = f∗
l (−α) +
1
2ηt
proxl1
λ,ηt
( ˆwt
+ ηtX⊤
α)
2
2
(48
wt+1
= proxl1
λ,ηt
wt
+ ηtX⊤
αt+1
(49

•
•
• α, v, w
•
Eckstein and Bertsekas(1992) Boyd et al. (2010)
•
•
min
α∈Rn,v∈Rd
f∗
l (−α) + δ||·||∞≤λ(v) +
η
2
||X⊤
α − v||2
2
Lη(α, v, w) = f∗
l (−α) + δ||·||∞≤λ(v) + w⊤
(X⊤
α − v) +
η
2
||X⊤
α − v||2
2
Lη(α, v, w) = f∗
l (−α) + δ||·||∞≤λ(v) + w⊤
(X⊤
α − v) +
η
2
||X⊤
α − v||2
2
max
w∈Rd
min
α∈Rn,v∈Rd
Lη(α, v, w)
(αt+1
, vt+1
) = arg min
α∈Rn,v∈Rd
Lη(α, v, wt
)
wt+1
= wt
+ η X⊤
αt+1
− vt+1
η l
α, v
(α
wt
αt+1
= arg min
α∈Rn
Lη(α, vt
, wt
) (
vt+1
= arg min
v∈Rd
Lη(αt+1
, v, wt
) (
wt+1
= wt
+ η X⊤
αt+1
− vt+1
(

•
•
•
•
https://en.wikipedia.org/wiki/Coordinate_descent

•
• βj λ βk
•
•
vt+1
= arg min
v∈Rd
Lη(αt+1
, v, wt
)
wt+1
= wt
+ η X⊤
αt+1
− vt+1
ˆβlasso
= arg min
β
⎧
⎨
⎩
1
2
d
i=1
(yi − β0 −
n
j=1
xij · βj)2
+ λ
n
j=1
|βj|
⎫
⎬
⎭
R(˜β(λ), βj) =
1
2
d
i=1
⎛
⎝yi −
k̸=j
xik · ˜βk(λ) − xij · βj
⎞
⎠
2
+ λ
k̸=i
| ˜βk(λ)| + λ|βj|
vt+1
= arg min
v∈Rd
Lη(αt+1
, v, w
wt+1
= wt
+ η X⊤
αt+1
− v
ˆβlasso
= arg min
β
⎧
⎨
⎩
1
2
d
i=1
(yi − β0 −
n
j=1
xij ·
R(˜β(λ), βj) =
1
2
d
i=1
⎛
⎝yi −
k̸=j
αt+1
= arg min
α∈Rn
Lη(α, vt
, wt
)
vt+1
= arg min
v∈Rd
Lη(αt+1
, v, wt
)
wt+1
= wt
+ η X⊤
αt+1
− vt+1
ˆβlasso
= arg min
β
⎧
⎨
⎩
1
2
d
i=1
(yi − β0 −
n
j=1
xij · βj)2
+ λ
n
j=1
|βj|
⎫
⎬
⎭
R(˜β(λ), βj) =
1
2
d
i=1
⎛
⎝yi −
k̸=j
⎞
⎠
2
+ λ
k̸=i
| ˜βk(λ)| + λ|βj|
α∈Rn
vt+1
= arg min
v∈Rd
Lη(αt+1
, v, wt
) (51)
wt+1
= wt
+ η X⊤
αt+1
− vt+1
(52)
o
= arg min
β
⎧
⎨
⎩
1
2
d
i=1
(yi − β0 −
n
j=1
xij · βj)2
+ λ
n
j=1
|βj|
⎫
⎬
⎭
(53)
=
1
2
d
i=1
⎛
⎝yi −
k̸=j
⎞
⎠
2
+ λ
k̸=i
| ˜βk(λ)| + λ|βj| (54)
yi − ˜y
(j)
i = yi −
k̸=j
xik
˜βk(λ) (55)
αt+1
= arg min
α∈Rn
Lη(α, vt
, wt
) (50)
vt+1
= arg min
v∈Rd
Lη(αt+1
, v, wt
) (51)
wt+1
= wt
+ η X⊤
αt+1
− vt+1
(52)
ˆβlasso
= arg min
β
⎧
⎨
⎩
1
2
d
i=1
(yi − β0 −
n
j=1
xij · βj)2
+ λ
n
j=1
|βj|
⎫
⎬
⎭
(53)
R(˜β(λ), βj) =
1
2
d
i=1
⎛
⎝yi −
k̸=j
⎞
⎠
2
+ λ
k̸=i
| ˜βk(λ)| + λ|βj| (54)
yi − ˜y
(j)
i = yi −
k̸=j
xik
˜βk(λ) (55)
˜β(λ) ← TH
d
i=1
xij(yij − ˜β
(j)
i ), λ (56)
TH( )

• Fu(1998) Daubechies et al. (2004)
Friedman et al. (2007) Wu and Lange (2008)
•
•
Friedman et al., 2010)
•
•
•
•
•
(Beck and Tetruashvili, 2013)

•
–
–
–
•
•
– µ
S S-1=Q
55
x1
x2
x3
x4
x5

•
–
–
–
56
x1
x2
e.g. x1, x2 1, 2
x2
P(x2)
x2
P(x2)
x1
x2
P(x1)

(Gaussian Markov Random Field)
•
• µ, Q
• x = (x1,…,xL)
xo, xu
•
57
p(xu | xo,µ,Θ) = N(x | µ,Θ−1
) δ(yi − xi )
i∈O
∏ dxoxo
∫
= N(xu | µu −Θuu
−1
Θuo (xo −µo ),Θuu
−1
)
Θ =
Θuu Θou
Θuo Θoo

#
$
$
%

'
'
µ =
µu
µo
!

#
#
$
%

δ(⋅)
ˆxu = argmax
xu
N(xu | µu −Θuu
−1
Θuo (xo −µo ),Θuu
−1
)
(1)

–
–
Gaussian Graphical Model (GGM)
( , 2014; Kataoka et al., 2014)
–
–
–
(Graphical Lasso; GL)
–
–
58

•
–
•
–
– Graphical Lasso (Friedman et al., 2007)
59
: 0
0
Θ =
Θ =

60
• x
•
–
– V+2 ( V + V2/2)
Θ µ
Z(β, γ, α) = exp
1
2
βT
Θ−1
β
∞
−∞
exp −
1
2
(x − µ)T
Θ(x − µ) dx (4.16)
(3.2)(3 (2))
Z(β, γ, α) = exp
1
2
βT
Θ−1
β (2π)Ndet(Θ−1) (4.17)
(4.16) (4.17)
p(x|β, γ, α) =
1
(2π)Ndet(Θ−1)
exp −
1
2
(x − µ)T
Θ(x − µ) (4.18)
GGM(4.11)
Kataoka et al. GGM
GGM
Θij ≡
ε + ∂(i)
−1
0
%

''
(
'
'
i = j
(i, j) E
otherwise
µ ≡
1
η
Θ−1
β i
G(V, E) GGM
p(x|β, η) ∝ exp βT
x −
ηϵ
2 i∈V
x2
i −
η
2 (i,j)∈E
(xi − xj)2
(4.19)
β η
ϵ
(4.19) (4.11) ηϵ γi
η α (4.19)
i
(4.19) GGM x
∂(i)
(4.19)
exp βx −
ηϵ
2 i∈V
x2
i −
η
2 (i,j)∈E
(xi − xj)2
= exp βx −
η
2 i∈V
ϵ + |∂(i)| x2
i + η
(i,j)∈E
xixj (4.20
= exp −
η
2
(x − µ)T
Θ(x − µ) +
η
2
βT
Θ−1
β
p(x|β, η) =
ηN det C
(2π)N
exp −
η
2
(x − µ)T
Θ(x − µ)
Θ
Θij =
⎧
⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎪⎨
⎪⎪⎪⎪⎪⎪⎪⎪⎪
ϵ + |∂(i)| i = j
−1 (i, j) ∈ E or (j, i) ∈ E
(2)
(3)

•
•
–
•
62
3.2
p(x|µ, Θ) :=
1
Z(Θ)
exp −
1
2
(x − µ)T
Θ(x − µ)
Σ
Θ
ln p(Θ, µ) =
D
2
log det Θ −
1
2
d
(xd
− µ)T
Θ(xd
− µ) + const
xd
D D d
xd
= (xd
1
, xd
2
, ..., xd
|V|
)
2 µ, Θ µ
15
p(x|µ, Θ) :=
Z(Θ)
exp −
2
(x − µ)T
Θ(x − µ) (4.1)
Σ
Θ
ln p(Θ, µ) =
D
2
log det Θ −
1
2
d
(xd
− µ)T
Θ(xd
− µ) + const (4.2)
xd
D D d
xd
= (xd
1
, xd
2
, ..., xd
|V|
)
2 µ, Θ µ
15
1
(4.2)
Θ∗
= arg max
Θ
log det Θ − tr(SΘ) − ρ||Θ||1 (4.31)
||Θ||1
|V|×|V|
i,j=1
|Θij| µ
(4.7) Θ
(4.31) L1
Θij = 0 4.3.2
Θ = Θij
j
|V|
∑
i
|V|
∑
Q1
Q2
(5)

•
–
–
•
Friedman
et al. (2007) Graphical Lasso
– L1
•
GL
63

• Q
• S G
• Q-1* W
–
64
(4.21) L1
(4.31) Θ
∂
∂Θ
ln p(Θ) = Θ−1
− S − ρ Γ (4.32)
Γ Θij 0 Γi,j = sign(Θij) Θij = 0 Γij ∈ [−1, 1]
(4.31) (4.32)
Θ−1
− S − ρ Γ = 0 (4.3
Σ = Θ−1
W Θ,W
Θ =
Θ11 θ12
θT
12
θ22
, W =
W11 w12
wT
12
w22
, S =
S11 s12
sT
12
s22
(4.3
θ12, w12, s12
θ22, w22, s22
(4.33) GL
Γij =
sign(Θij )
∈ [−1,1]
%

'
('
if Qij ≠ 0
if Qij = 0
(4.31) (4.32)
Θ−1
− S − ρ Γ = 0 (4.33)
Σ = Θ−1
W Θ,W
Θ =
Θ11 θ12
θT
12
θ22
, W =
W11 w12
wT
12
w22
, S =
S11 s12
sT
12
s22
(4.34)
θ12, w12, s12
θ22, w22, s22
(4.33) GL
(4.31) (4.32)
Θ−1
− S − ρ Γ = 0 (4.33)
Σ = Θ−1
W Θ,W
Θ =
Θ11 θ12
θT
12
θ22
, W =
W11 w12
wT
12
w22
, S =
S11 s12
sT
12
s22
(4.34)
θ12, w12, s12
θ22, w22, s22
(4.33) GL
(4.31) (4.32)
Θ−1
− S − ρ Γ = 0 (4.33
Σ = Θ−1
W Θ,W
S
Θ =
Θ11 θ12
θT
12
θ22
, W =
W11 w12
wT
12
w22
, S =
S11 s12
sT
12
s22
(4.34
θ12, w12, s12
θ22, w22, s22
(4.33) GL
11W 12w
22w
12
T
w
(6)

•
• (6) (7)
• WQ=I
–
65
L1
−1
b = W −1/2
11
s12
35) (4.34) (4.33
w12 − s12 − ρ γ12 = 0
W Θ = I
1 θ12
2
θ22
=
W11Θ11 + w12θT
12
W11θ12 + θ22w12
θT
12
W + θwT
12
wT
12
θ + w22θ22
=
I 0
0T
1
∂
∂β
1
2
W 1/2
11
β − b
2
+ ρ||β||1 = 0 (4.35
β L1 β
W −1
11
w12 β ∈ R|V|−1
b = W −1/2
11
s12
(4.35) (4.34) (4.33)
w12 − s12 − ρ γ12 = 0 (4.36
W Θ = I
W11 w12
wT
12
w22
Θ11 θ12
θT
12
θ22
=
W11Θ11 + w12θT
12
W11θ12 + θ22w12
θT
12
W + θwT
12
wT
12
θ + w22θ22
=
I 0
0T
1
(4.37
W11θ12 + θ22w12 = 0
4.35) (4.34) (4.33)
w12 − s12 − ρ γ12 = 0 (4.36
W Θ = I
Θ11 θ12
θT
12
θ22
=
W11Θ11 + w12θT
12
W11θ12 + θ22w12
θT
12
W + θwT
12
wT
12
θ + w22θ22
=
I 0
0T
1
(4.37
W11θ12 + θ22w12 = 0
(7)
(8)
L1
(4.31) Θ
∂
∂Θ
ln p(Θ) = Θ−1
− S − ρ Γ (4.32)
Θij 0 Γi,j = sign(Θij) Θij = 0 Γij ∈ [−1, 1]
(4.31) (4.32)
Θ−1
− S − ρ Γ = 0 (4.
Σ = Θ−1
W Θ,W
=
Θ11 θ12
θT
12
θ22
, W =
W11 w12
wT
12
w22
, S =
S11 s12
sT
12
s22
(4.
(6 )

•
• b (7), (8)
•
66
β ≡ W11
−1
w12 b ≡ W11
−1/2
s12
, (4.36)
W11β − s12 − ρ γ12 = 0 (4.39
Θ θ22 0 sign(θ12)
ign(β) (4.39)
=
∂
∂β
1
2
βT
W11β − βT
s12 + ρ||β||1
=
∂ 1
W 1/2
11
β − W 1/2
11
s12
2
− βT
s12 + βT
s12 −
1
W −1
11 s2
12 + ρ||β||1
31
θ12 = −θ22W −1
11 w12 = −θ22β (4.38)
(4.36)
W11β − s12 − ρ γ12 = 0 (4.39)
Θ θ 0 sign(θ ) =
11
β = W −1
11
w12 (4.36)
W11β − s12 − ρ γ12 = 0
Θ θ22 0 sign(
−sign(W −1
11
w12) = −sign(β) (4.39)
W11β − s12 − ρ γ12 =
∂
∂β
1
2
βT
W11β − βT
s12 + ρ||β||1
=
∂
∂β
1
2
W 1/2
11
β − W 1/2
11
s12
2
− βT
s12 + βT
s12 −
1
2
W −1
11 s2
12 +
=
∂
∂β
1
2
W 1/2
11
β − b
2
+ ρ||β||2
1 = 0
(4.35)
(9)
(10)
θ12 = −θ22W −1
11 w12 = −θ22β (4.38)
β = W −1
11
w12 (4.36)
W11β − s12 − ρ γ12 = 0 (4.39)
Θ θ22 0 sign(θ12) =
−sign(W −1
11
w12) = −sign(β) (4.39)
W11β − s12 − ρ γ12 =
∂
∂β
1
2
βT
W11β − βT
s12 + ρ||β||1
=
∂
∂β
1
2
W 1/2
11
β − W 1/2
11
s12
2
− βT
s12 + βT
s12 −
1
2
W −1
11 s2
12 + ρ||β||1
=
∂
∂β
1
2
W 1/2
11
β − b
2
+ ρ||β||2
1 = 0 (4.40)
(4.35)
Σ W w12 β
θ12 = −θ22W −1
11 w12 = −θ22β (4.38)
β = W −1
11
w12 (4.36)
W11β − s12 − ρ γ12 = 0 (4.39)
Θ θ22 0 sign(θ12) =
−sign(W −1
11
w12) = −sign(β) (4.39)
W11β − s12 − ρ γ12 =
∂
∂β
1
2
βT
W11β − βT
s12 + ρ||β||1
=
∂
∂β
1
2
W 1/2
11
β − W 1/2
11
s12
2
− βT
s12 + βT
s12 −
1
2
W −1
11 s2
12 + ρ||β||1
=
∂
∂β
1
2
W 1/2
11
β − b
2
+ ρ||β||2
1 = 0 (4.40)
(4.35)b
(11)

•
• GL W
Mizumder and Hastie (2012)
67
11W
12w
22w
12
T
w
W = S + rI
W (11) b
W
ˆβ w12 = W11
ˆβ
ˆw12

•
–
•
–
–
– Θ ← Θold
68
Q(Θ | Θold
) = ln p(xu, y | Θ)p(xu | y,Θold
)dxuxu
∫
(4.2)
Θ∗
= arg max
Θ
log det Θ − tr(SΘ) − ρ||Θ||1
||Θ||1
|V|×|V|
i,j=1
|Θij|

•
–
–
– 2*1183 + 1183*1182/2 70
•
–
–
–
–
–
69

71
41
0-5%
5-10%
10-50%
50-100%

72
43
0
100
200
300
400
500
600
700
1 2 3 4 5 6 7 8 9 10 11
freqency
speed (km/h)
0 10 20 30 40 50 60 70 80 90 100
– 5–5
40
60
80
100
120
140
freqency
3
( 5–4) ( 5–5)
( 5–6)
0
100
200
300
400
500
600
700
800
900
1000
1 2 3 4 5 6 7 8 9 10 11
freqency
speed (km/h)
0 10 20 30 40 50 60 70 80 90 100
– 5–4
5–7
43
0
100
200
300
400
500
600
700
1 2 3 4 5 6 7 8 9 10 11
freqency
speed (km/h)
0 10 20 30 40 50 60 70 80 90 100
– 5–5
0
20
40
60
80
100
120
140
1 2 3 4 5 6 7 8 9 10 11
freqency
speed (km/h)
0 10 20 30 40 50 60 70 80 90 100

•
–
–
–
•
–
–
–
•
73

80 km/h over
60-80 km/h
40-60 km/h
20-40 km/h
0-20 km/h
74

80 km/h over
60-80 km/h
40-60 km/h
20-40 km/h
0-20 km/h
75

•
•
→ 77
46
0
5
10
15
20
25
30
0.1 0.05 0.01 0.005 0.001
計算時間(hour)
正則化パラメータρ
GGM: 25時間42分

•
–
0.1 over
0.1 ~ 0.05
-0.01 ~ -0.02
-0.02 under
81

•
1.
2. GGM
Graphical Lasso
3. EM
GGM, GL EM
4.
5. GL
•
–
– 83

1. Kataoka, S., Yasuda, M., Furtlehner, C., and Tanaka, K., : Traffic
data reconstruction based on Markov random field modeling,
Inverse Problems, 30025003, 2014.
2. Freedman, J., Hastie, T. and Tibshirani, R., :Sparse inverse
covariance estimation with the graphical lasso, Biostatistics, 9, 3,
pp. 432-441, 2008.
3. Mazumder, R., and Hastie, T. : The graphical lasso: New insights
and alternatives. Electronic journal of statistics, 6, pp. 2125-2149,
2012.
4. Dempster, A. P., Laird, N. M., and Rubin, D. B., :Maximum
Likelihood from Incomplete Data via the EM Algorithm, Journal
of the Royal Statistical Society. Series B (Methodological), 39, 1,
pp.1-38, 1977.
5. , , , :
, 12
ITS 2014 Peer-Review Proceedings, CD-ROM,
2014.
84

• Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society. Series B (Methodological), 267-288.
(Google scholar 21305)
• Donoho, D. L. (2006). Compressed sensing. IEEE Transactions on
information theory, 52(4), 1289-1306. ( 19534)
• Olshausen, B. A., Field, D. J. (1996). Emergence of simple-cell receptive
field properties by learning a sparse code for natural images. Nature,
381(6583), 607. ( 4765)
• Candes, E. J., Tao, T. (2005). Decoding by linear programming. IEEE
transactions on information theory, 51(12), 4203-4215. ( 5488)
• Candes, E., Tao, T. (2007). The Dantzig selector: Statistical estimation
when p is much larger than n. The Annals of Statistics, 2313-2351. (
2603)
• Candès, E. J., Romberg, J., Tao, T. (2006). Robust uncertainty principles:
Exact signal reconstruction from highly incomplete frequency information.
IEEE Transactions on information theory, 52(2), 489-509. ( 12285)

•
•
•
•
•
•
•
•
• Rish and Grabarnik, Sparse Modeling Theory, Algorithms, and Applications,
CRC Press, 2014.
•
• Elder and Kutyniok, Compressed Sensing Theory and Applications, Cambridge
University Press, 2012.
•

• Cover, T. M., Van Campenhout, J. M. (1977). On the possible orderings in the measurement selection
problem. IEEE Transactions on Systems, Man, and Cybernetics, 7(9), 657-661.
• Bengio, Y., Courville, A., Vincent, P. (2013). Representation learning: A review and new perspectives.
IEEE transactions on pattern analysis and machine intelligence, 35(8), 1798-1828.
• Tomioka, R., Sugiyama, M. (2009). Dual-augmented Lagrangian method for efficient sparse
reconstruction. IEEE Signal Processing Letters, 16(12), 1067-1070.
• Eckstein, J., Bertsekas, D. P. (1992). On the Douglas—Rachford splitting method and the proximal
point algorithm for maximal monotone operators. Mathematical Programming, 55(1), 293-318.
• Boyd, S., Parikh, N., Chu, E., Peleato, B., Eckstein, J. (2011). Distributed optimization and statistical
learning via the alternating direction method of multipliers. Foundations and Trends® in Machine
Learning, 3(1), 1-122.
• Fu, W. J. (1998). Penalized regressions: the bridge versus the lasso. Journal of computational and
graphical statistics, 7(3), 397-416.
• Daubechies, I., Defrise, M., De Mol, C. (2004). An iterative thresholding algorithm for linear inverse
problems with a sparsity constraint. Communications on pure and applied mathematics, 57(11), 1413-
1457.
• Friedman, J., Hastie, T., Höfling, H., Tibshirani, R. (2007). Pathwise coordinate optimization. The
Annals of Applied Statistics, 1(2), 302-332.
• Wu, T. T., Lange, K. (2008). Coordinate descent algorithms for lasso penalized regression. The
Annals of Applied Statistics, 224-244.
• Beck, A., Tetruashvili, L. (2013). On the convergence of block coordinate descent type methods.
SIAM journal on Optimization, 23(4), 2037-2060.

スパースモデリング

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à スパースモデリング

Similaire à スパースモデリング (20)

Dernier

Dernier (20)

スパースモデリング