Science 7 - LAND and SEA BREEZE and its Characteristics
QMC: Operator Splitting Workshop, Projective Splitting with Forward Steps and Greedy Activation - Jonathan Eckstein, Mar 22, 2018
1. March 2018 1 of 22
Projective Splitting with Forward Steps
and Greedy Activation
Jonathan Eckstein
Rutgers University, New Jersey, USA
Joint work with
Patrick Johnstone (my postdoc)
Rutgers University, New Jersey, USA
Based on earlier work with
Patrick Combettes, Benar F. Svaiter
Funded in part by US National Science
Foundation Grant CCF-1617617
2. March 2018 2 of 22
Convex/Monotone Problem Setting
• 0 1, , , n are real Hilbert spaces
• 0:i iG → is a bounded/continuous linear operator 1..i n∀ ∈
• :i iT i is a maximal monotone operator 1..i n∀ ∈
*
0
1
: 0 ( )
n
i i i
i
x G T G x
=
∈ ∈∑Find
Generalization of
0
1
min ( )
n
i i
x
i
f G x
∈
=
∑
Where : { }i if → ∪ +∞ are closed proper convex 1..i n∀ ∈
3. March 2018 3 of 22
For this Short Talk, a Simplification
• 0 1 n= = = =
• IdiG = 1..i n∀ ∈
1
: 0 ( )
n
i
i
x T x
=
∈ ∈∑Find
which generalizes
1
min ( )
n
i
x
i
f x
∈
=
∑
4. March 2018 4 of 22
The Kuhn-Tucker Set and Fejér Projection Algorithms
1
1 1
1
( , , , ) ( ) 1.. 1, ( )
n
n
n i i i n
i
z w w w T z i n w T z
−
−
=
∈ ∈ ∀ ∈ − − ∈
∑
• z solves the inclusion 1 1 1 1, , :( , , , )n nw w z w w− −⇔ ∃ ∈ ∈
• is a closed convex set
• We will use a separating hyperplane projection algorithm to
try to (weakly) converge to a point in
• Fejér monotone: non-increasing distance to all points in
{ }
is affine
( ) 0
( ) 0
( ) 0
k
k k
k
k k
H p p
p p
p
ϕ
ϕ
ϕ
ϕ
= =
≤ ∀ ∈
>
1kp +
kp
5. March 2018 5 of 22
A Family of Separating Hyperplanes
Given ( , ) graph 1..i i ix y T i n∈ ∀ ∈ define
1 1
1 1
1 1
( , , , ) , ,
n n
n i i i n n i
i i
z w w z x y w z x y wϕ
− −
−
= =
= − − + − +∑ ∑
• ϕ is an affine function on n
(the quadratic terms cancel)
• 1, , nT T monotone 1 1 1 1( , , , ) 0 ( , , , )n nz w w z w wϕ − −⇒ ≤ ∀ ∈
6. March 2018 6 of 22
Constructing a Separating Hyperplane
Given 1 1( , , )k k k k
np z w w − ∉ , can we find ( , ) graphk k
i i ix y T∈ such
that
1 1
1 1
1 1
( , , ) , , 0
n n
k k k k k k k k k k k
k n i i i n n i
i i
z w w z x y w z x y wϕ
− −
−
=
= − − + − + >∑ ∑ ?
Sufficient to solve the following for each iT :
• Given maximal monotone :T , ( )( , ) graphz w T∈ × ,
find ( , ) graphx y T∈ such that
, 0z x y w− − > or equivalently , 0x z y w− − <
7. March 2018 7 of 22
Using a Proximal (Backward) Step ( 1.. 1i n∈ − )
• Take any 0ρ > . Then the proximal step finds the unique
( , ) graphk k
i i ix y T∈ such that k k k k
i i ix y z wρ ρ+ = +
• So
2
1
( ) , 0k k k k k k k k k k
i i i i i i iy w z x z x y w z xρρ − = − ⇒ − − = − ≥
k k
ix y z wρ ρ+ = +
( , )k k
i ix y
( , )k k
iz w
iT
8. March 2018 8 of 22
Using a Proximal (Backward) Step, Continued
• Defining
1
1
nk k
n ii
w w
−
=
= −∑ , the same thing works for i n=
• Adding up, 1 1 1
( , , ) , 0
nk k k k k k k
k n i i ii
z w w z x y wϕ − =
= − − ≥∑
• And if 1 1( , , ) 0k k k
k nz w wϕ − = , then 1
k k k
nz x x= = = , k k
i iw y i= ∀ ,
meaning that 1 1( , , )k k k
nz w w − ∈ since ( , ) graphk k
i i ix y T i∈ ∀
• So we strictly separate any 1 1( , , ) k k k k n
np z w w −= ∈ from
{ }( ) 0
( ) 0
( ) 0
k k
k
k
H p p
p p
p
ϕ
ϕ
ϕ
= =
≤ ∀ ∈
>
k
p
9. March 2018 9 of 22
Algorithm Close to a Special Case of E and Svaiter 2009
Starting with an arbitrary 0 0 0
1( , , , )nz w w ∈ :
For 0,1,2,k =
1. For 1, ,i n= , compute
( )( , ) Prox ( )
k
i
i
k k k k k
i i T i ix y z w
ρ
ρ= +
(Decomposition Step) (parameters k
iρ can vary with i and k)
2. Define
1 1
1 1
1 1
( , , , ) , ,
n n
k k k k
k n i i i n n i
i i
z w w z x y w z x y wϕ
− −
−
= =
= − − + − +∑ ∑
3. Compute 1 1 1 1
1 1( , , , )k k k k
np z w w+ + + +
−= by projecting
1
1 1( , , , )k k k
nz w w+
− onto the halfspace 1 1( , , , ) 0k nz w wϕ − ≤
(possibly with some overrelaxation) (Coordination Step)
E and Svaiter 2009 showed that the cuts 1 1( , , , ) 0k nz w wϕ − ≤
obtained this way (and generalizations) are sufficiently deep for
{ }k
z to converge (weakly) to a solution. For fixed min max0 ρ ρ< ≤ ,
any choices of [ ]min max,k
iρ ρ ρ∈ are permitted.
10. March 2018 10 of 22
More on This Class of Algorithm
• (Overrelaxed) projection:
{ }1
2
max 0, ( )k
kk k
k k
k
p
p p
ϕ
β ϕ
ϕ
+
=− ∇
∇
• Helpful to use a scaled norm to adjust primal/dual weighting
Further developments:
• Alotaibi, Combettes & Shahzad 2013: including a linear
mapping G and solve *
1 20 ( ) ( )T z G T Gz∈ + with proximal steps on
1T and 2T (not *
2G T G )
• Combettes and Eckstein 2016: block iterative and
asynchronous versions with 2n ≥ operators
oBlock iterative: at each iteration, process only a subset of
blocks i, keep remaining ( , ) graphk k
i i ix y T∈ unchanged
oAsynchronous: proximal operations can use (boundedly)
outdated information, allowing asynchronous parallel
operation
11. March 2018 11 of 22
A Recently Solved Challenge
Within the context of this kind of projective splitting algorithm:
• Suppose iT is Lipschitz continuous with constant iL
• Do we really have to perform a proximal step on such an
operator? Can’t we use forward steps instead?
oThere are a variety of splitting algorithms that use forward
( )ix T xρ− steps on Lipschiptz operators...
o ...with the stepsize ρ typically bounded by something
proportional to 1/ iL
Answer (from Patrick Johnstone):
• For a Lipschitz operator, you can substitute two forward steps
for a proximal step
12. March 2018 12 of 22
Using Two Forward Steps
( )
1
2 2 2
1 1
, , ,
, ( ) ,
k k k k k k k k k k k k
i i i i i i i i i
k k k k k k k k
i i i i i i
k k k k k k
i i i i i
z x y w z x T z w z x T z y
z x z x z x T z T x
z x L z x L z x
ρ
ρ ρ
− − = − − − − −
= − − − − −
≥ − − − = − −
( , )k k
iz w
iT
( , )k k
iz T z
( , )k k
i ix w
( , ) ( , )k k k k
i i i i ix y x T x=
1/ρ
( )k k k k
i i ix z T z wρ=− − , then k k
i i iy T x=
13. March 2018 13 of 22
Using Two Forward Steps, Continued
• So if 1
1/i iL Lρ ρ> ⇔ < , we get a valid step
• And it turns out that all the convergence theory continues to
go through, including block iterations and asynchonicity
Variations
• If iL is unknown, instead possible to pick some 0∆ > and
backtrack on ρ until
2
,k k k k k k
i i i iz x y w z x− − ≥ ∆ −
Will eventually occur for small enough ρ if iT is Lipschitz:
1t + operator evaluations, where t is # of backtrack steps
• If iT is affine, can just solve for ρ in 2 total evaluations, or…
• ...similarly, solve for ρ maximizing ,k k k k
i i iz x y w− −
• The convergence theory still holds with all these techniques
14. March 2018 14 of 22
“Greedy” Activation Heuristic
• If we don’t overrelax, iteration 1k − typically leaves us with
1 1
1 1 1 1
( , , ) , 0
nk k k k k k k
k n i i ii
z w w z x y wϕ − −
− − =
= − − =∑
• If we find an i for which 1 1
,k k k k
i i iz x y w− −
− − is negative, we can
increase it to at least 0 and immediately cut off the current
iterate
oWorks with either a proximal step or our two-forward-step
technique
• Heuristic: give priority to processing i for which
1 1
,k k k k
i i iz x y w− −
− − is the most negative
• This does not really maximize the distance to the separator,
but seems to be a useful proxy
15. March 2018 15 of 22
Some Very Preliminary Computational Tests: LASSO
LASSO problems:
{ }21
2 1
mind
x
Qx b xλ
∈
− +
Partition Q into r blocks of rows, set 1n r= +
21
2 1
1
mind
r
i i
x
i
Q x b xλ
∈
=
− +
∑
So we can set
1
( ) ( ), 1.. 1i i i i nT x Q Q x b i n T λ= − ∀ ∈ − = ∂ ⋅T
• At each iteration, process blocks { , }i n , where 1.. 1i n∈ − is
selected randomly or greedily; forward steps use ∆ technique
• Did some primal-dual scaling (simple norm change)
• Also simulate random asychronicity delays
• Measure the number of “Q-equivalent” matrix multiplies
16. March 2018 16 of 22
Preliminary Test Results: Blog Data
Legend: ( , )r D , where D is max delay, G = greedy & no delay
17. March 2018 17 of 22
Preliminary Test Results: Crime Data
18. March 2018 18 of 22
Preliminary Test Results: Randomly Generated Data
19. March 2018 19 of 22
Observations
• Projective splitting seems to have some promise as a way to
build efficient parallel algorithms for large-scale problems
• Breaking up the loss function term into multiple blocks seems
to speed up the projective splitting methods – an unusual
property for decomposition algorithms
• Greedy block activation looks useful
• It seems helpful to use forward steps for affine operators
More Coming Soon
• Convergence rate analyses (see also Machado 2017)
20. March 2018 20 of 22
Big Open Question
• What is a “killer app” for projective splitting?
21. March 2018 21 of 22
Some More Open Questions
Adaptive stepsizes: how might we fully exploit all the allowed
parameter variability?
• Projective splitting has existed for nearly a decade, but we
still don’t know how to use all the extra parameter variability
it allows
• From iteration to iteration
Related question:
• Given maximal monotone :T , ( )( , ) graphz w T∈ × ,
find ( , ) graphx y T∈ such that
,x z y w− − is minimized (or at least a “large” negative number)
• Better yet, minimize 2 2
,x z y w
x yγ
− −
+
22. March 2018 22 of 22
References
• Patrick R. Johnstone and Jonathan Eckstein. “Projective
Splitting with Forward Steps: Asynchronous and Block-Iterative
Operator Splitting”. Optimization Online and ArXiv, released
March 2018.