Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Submodularity slides
1. Beyond Convexity –
Submodularity in Machine Learning
Andreas Krause, Carlos Guestrin
Carnegie Mellon University
International Conference on Machine Learning | July 5, 2008
Carnegie Mellon
2. Acknowledgements
Thanks for slides and material to Mukund Narasimhan,
Jure Leskovec and Manuel Reyes Gomez
MATLAB Toolbox and details for references available at
http://www.submodularity.org
Algorithms implemented M
2
3. Optimization in Machine Learning
+ Classify + from – by
w1 finding a separating
+ + -
- hyperplane (parameters w)
+ + -
-
w* - - Which one should we choose?
w2
Define loss L(w) = “1/size of margin”
Solve for best vector w* = argminw L(w)
L(w)
Key observation: Many problems in ML are convex!
w
no local minima!! 3
4. Feature selection
Given random variables Y, X1, … Xn
Want to predict Y from subset XA = (Xi1,…,Xik)
Y
“Sick”
Naïve Bayes
X1 X2 X3 Model
“Fever” “Rash” “Male”
Want k most informative features:
A* = argmax IG(XA; Y) s.t. |A| · k
Uncertainty Uncertainty
before knowing XA after knowing XA
where IG(XA; Y) = H(Y) - H(Y | XA)
4
5. Factoring distributions
V
Given random variables X1,…,Xn X1 X3 X2
X4 X6 X5 X7
Partition variables V into sets A and
VnA as independent as possible
X1 X3 X2 X7
X4 X6 X5
Formally: Want A
VnA
A* = argminA I(XA; XVnA) s.t. 0<|A|<n
where I(XA,XB) = H(XB) - H(XB j XA)
Fundamental building block in structure learning
[Narasimhan&Bilmes, UAI ’04]
5
6. Combinatorial problems in ML
Given a (finite) set V, function F: 2V ! R, want
A* = argmin F(A) s.t. some constraints on A
Solving combinatorial problems:
Mixed integer programming?
Often difficult to scale to large problems
Relaxations? (e.g., L1 regularization, etc.)
Not clear when they work
This talk:
Fully combinatorial algorithms (spanning tree, matching, …)
Exploit problem structure to get guarantees about solution!
6
7. Example: Greedy algorithm for feature selection
Given: finite set V of features, utility function F(A) = IG(XA; Y)
Want: A*µ V such that
Y
“Sick”
NP-hard!
X1 X2 X3
Greedy algorithm: M “Fever” “Rash” “Male”
Start with A = ;
For i = 1 to k
s* := argmaxs F(A [ {s})
A := A [ {s*}
How well can this simple heuristic do? 7
8. Key property: Diminishing returns
Selection A = {} Selection B = {X2,X3}
Y Y
“Sick” “Sick”
X2 X3
“Rash” “Male”
Adding X1 X1 Adding X1
Theorem [Krause, Guestrin UAI ‘05]: Information gain F(A) in
“Fever”
will help a lot! doesn’t help much
Naïve Bayes models is submodular!
New
feature X1
Submodularity: B A + s Large improvement
+ s Small improvement
For Aµ B, F(A [ {s}) – F(A) ¸ F(B [ {s}) – F(B)
8
9. Why is submodularity useful?
Theorem [Nemhauser et al ‘78]
Greedy maximization algorithm returns Agreedy:
F(Agreedy) ¸(1-1/e) max|A|·k F(A)
~63%
Greedy algorithm gives near-optimal solution!
More details and exact statement later
For info-gain: Guarantees best possible unless P = NP!
[Krause, Guestrin UAI ’05]
9
10. Submodularity in Machine Learning
In this tutorial we will see that many ML problems are
submodular, i.e., for F submodular require:
Minimization: A* = argmin F(A)
Structure learning (A* = argmin I(XA; XVnA))
Clustering
MAP inference in Markov Random Fields
…
Maximization: A* = argmax F(A)
Feature selection
Active learning
Ranking
…
10
11. Tutorial Overview
Examples and properties of submodular functions
Submodularity and convexity
Minimizing submodular functions
Maximizing submodular functions
Research directions, …
LOTS of applications to Machine Learning!!
11
13. Set functions
Finite set V = {1,2,…,n}
Function F: 2V ! R
Will always assume F(;) = 0 (w.l.o.g.)
Assume black-box that can evaluate F for any input A
Approximate (noisy) evaluation of F is ok (e.g., [37])
Example: F(A) = IG(XA; Y) = H(Y) – H(Y | XA)
= ∑y,x P(xA) [log P(y | xA) – log P(y)]
A
Y Y
“Sick” “Sick”
X1 X2 X2 X3
“Fever” “Rash” “Rash” “Male”
F({X1,X2}) = 0.9 F({X2,X3}) = 0.5 13
14. Submodular set functions
Set function F on V is called submodular if
For all A,B µ V: F(A)+F(B) ¸ F(A[B)+F(AÅB)
+ ¸ +
A A [ BB
AÅB
Equivalent diminishing returns characterization:
Submodularity: B A + S Large improvement
+ S Small improvement
For AµB, s∉B, F(A [ {s}) – F(A) ¸ F(B [ {s}) – F(B)
14
15. Submodularity and supermodularity
Set function F on V is called submodular if
1) For all A,B µ V: F(A)+F(B) ¸ F(A[B)+F(AÅB)
2) For all AµB, s∉B, F(A [ {s}) – F(A) ¸ F(B [ {s}) – F(B)
F is called supermodular if –F is submodular
F is called modular if F is both sub- and supermodular
for modular (“additive”) F, F(A) = ∑i2A w(i)
15
16. Example: Set cover
Place sensors Want to cover floorplan with discs
in building Possible
OFFICE OFFICE QUIET PHONE
locations
CONFERENCE
STORAGE
V
LAB
ELEC COPY
SERVER
KITCHEN
Node predicts For A µ V: F(A) = “area
values of positions covered by sensors placed at A”
with some radius
Formally:
W finite set, collection of n subsets Si µ W
For A µV={1,…,n} define F(A) = |i2 A Si|
16
17. Set cover is submodular
A={S1,S2}
OFFICE OFFICE QUIET PHONE
CONFERENCE
S1 S2
STORAGE
LAB
ELEC COPY
SERVER
S’
KITCHEN
F(A[{S’})-F(A)
OFFICE OFFICE QUIET PHONE
¸
CONFERENCE
F(B[{S’})-F(B)
S1 S2
STORAGE
LAB
ELEC COPY
S3
SERVER
S4
KITCHEN
S’
B = {S1,S2,S3,S4}
17
18. Example: Mutual information
Given random variables X1,…,Xn
F(A) = I(XA; XVnA) = H(XVnA) – H(XVnA |XA)
Lemma: Mutual information F(A) is submodular
F(A [ {s}) – F(A) = H(Xsj XA) – H(Xsj XVn(A[{s}) )
Nonincreasing in A: Nondecreasing in A
AµB ) H(Xs|XA) ¸ H(Xs|XB)
δ s(A) = F(A[{s})-F(A) monotonically nonincreasing
F submodular 18
19. Example: Influence in social networks
[Kempe, Kleinberg, Tardos KDD ’03]
Dorothy Eric
Alice 0.2
Prob. of
influencing
0.5 0.4
0.3 0.2 0.5
Bob
0.5 Fiona
Charlie
Who should get free cell phones?
V = {Alice,Bob,Charlie,Dorothy,Eric,Fiona}
F(A) = Expected number of people influenced when targeting A
19
20. Influence in social networks is submodular
[Kempe, Kleinberg, Tardos KDD ’03]
Dorothy Eric
Alice 0.2
0.5 0.4
0.3 0.2 0.5
Bob
0.5 Fiona
Charlie
Key idea: Flip coins c in advance “live” edges
Fc(A) = People influenced under outcome c (set cover!)
F(A) = ∑ P(c) F (A) is submodular as well! 20
21. Closedness properties
F1,…,Fm submodular functions on V and λ1,…,λm > 0
Then: F(A) = ∑i λi Fi(A) is submodular!
Submodularity closed under nonnegative linear
combinations!
Extremely useful fact!!
Fθ(A) submodular ) ∑θ P(θ) Fθ(A) submodular!
Multicriterion optimization:
F1,…,Fm submodular, λi¸0 ) ∑i λi Fi(A) submodular
21
22. Submodularity and Concavity
Suppose g: N ! R and F(A) = g(|A|)
Then F(A) submodular if and only if g concave!
E.g., g could say “buying in bulk is cheaper”
g(|A|)
|A|
22
23. Maximum of submodular functions
Suppose F1(A) and F2(A) submodular.
Is F(A) = max(F1(A),F2(A)) submodular?
F(A) = max(F1(A),F2(A))
F1(A)
F2(A)
|A|
max(F1,F2) not submodular in general!
23
24. Minimum of submodular functions
Well, maybe F(A) = min(F1(A),F2(A)) instead?
F1(A) F2(A) F(A)
F({b}) – F(;)=0
; 0 0 0
<
{a} 1 0 0
F({a,b}) – F({a})=1
{b} 0 1 0
{a,b} 1 1 1
min(F1,F2) not submodular in general!
But stay tuned – we’ll address mini Fi later!
24
25. Duality
For F submodular on V let G(A) = F(V) – F(VnA)
G is supermodular and called dual to F
Details about properties in [Fujishige ’91]
F(A)
|A|
|A|
G(A)
25
26. Tutorial Overview
Examples and properties of submodular functions
Many problems submodular (mutual information, influence, …)
SFs closed under positive linear combinations;
not under min, max
Submodularity and convexity
Minimizing submodular functions
Maximizing submodular functions
Extensions and research directions
26
28. Submodularity and convexity
For V = {1,…,n}, and A µ V, let
wA = (w1A,…,wnA) with
wiA = 1 if i 2 A, 0 otherwise
Key result [Lovasz ’83]: Every submodular function
F induces a function g on Rn+, such that
F(A) = g(wA) for all A µ V
g(w) is convex
minA F(A) = minw g(w) s.t. w 2 [0,1]n
Let’s see how one can define g(w)
28
29. The submodular polyhedron PF
PF = {x 2 Rn: x(A) · F(A) for all A µ V} Example: V = {a,b}
A F(A)
x(A) = ∑i2 A xi ; 0
{a} -1
{b} 2
{a,b} 0
x{b}
2 x({b}) · F({b})
PF
1 x({a,b}) · F({a,b})
-2 -1 0 1 x{a}
x({a}) · F({a})
29
30. Lovasz extension
Claim: g(w) = maxx2PF wTx
PF = {x 2 Rn: x(A) · F(A) for all A µ V}
w{b} xw=argmaxx2 PF wT x
xw
2 g(w)=wT xw
w
1
-2 -1 0 1 w{a}
Evaluating g(w) requires solving a linear program with
exponentially many constraints
30
31. x{b}
xw
2
Evaluating the Lovasz extension w
1
g(w) = maxx2PF wTx -2 -1 0 1 x{a}
PF = {x 2 Rn: x(A) · F(A) for all A µ V}
Theorem [Edmonds ’71, Lovasz ‘83]:
For any given w, can get optimal solution xw to the LP
using the following greedy algorithm:
Order V={e1,…,en} so that w(e1)¸ …¸ w(en) M
Let xw(ei) = F({e1,…,ei}) – F({e1,…,ei-1})
Then wT xw = g(w) = maxx2 PF wT x
Sanity check: If w = wA and A={e1,…,ek}, then
AT k 31
33. Why is this useful?
x{b}
2 [0,1]2
Theorem [Lovasz ’83]: 1
g(w) attains its minimum in [0,1]n at a corner!
0 1 x{a}
If we can minimize g on [0,1]n, can minimize F…
(at corners, g and F take same values)
F(A) submodular g(w) convex
(and efficient to evaluate)
Does the converse also hold?
No, consider g(w1,w2,w3) = max(w1,w2+w3)
{a} {b} {c} F({a,b})-F({a})=0 < F({a,b,c})-F({a,c})=1 33
34. Tutorial Overview
Examples and properties of submodular functions
Many problems submodular (mutual information, influence, …)
SFs closed under positive linear combinations;
not under min, max
Submodularity and convexity
Every SF induces a convex function with SAME minimum
Special properties: Greedy solves LP over exponential polytope
Minimizing submodular functions
Maximizing submodular functions
Extensions and research directions
34
37. Minimizing a submodular function
Want to solve A* = argminA F(A)
Need to solve
minw maxx wTx g(w)
s.t. w2[0,1]n, x2PF
Equivalently:
minc,w c
s.t. c ¸ wT x for all x2PF
w2 [0,1]n
37
38. Ellipsoid algorithm
[Grötschel, Lovasz, Schrijver ’81]
Feasible region Optimality
direction
minc,w c
s.t. c ¸ wT x for all x2PF
w2 [0,1]n
Separation oracle: Find most violated constraint:
maxx wT x – c s.t. x 2 PF
Can solve separation using the greedy algorithm!!
Ellipsoid algorithm minimizes SFs in poly-time! 38
39. Minimizing submodular functions
Ellipsoid algorithm not very practical
Want combinatorial algorithm for minimization!
Theorem [Iwata (2001)]
There is a fully combinatorial, strongly polynomial
algorithm for minimizing SFs, that runs in time
O(n8 log2 n)
Polynomial-time = Practical ???
39
40. A more practical alternative?
[Fujishige ’91, Fujishige et al ‘06]
x({a,b})=F({a,b}) x A F(A)
{b}
; 0
2 {a} -1
x* {b} 2
[-1,1] 1 Base polytope: {a,b} 0
BF = PF Å {x(V) = F(V)}
-2 -1 0 1 x{a}
Minimum norm algorithm: M
Find x* = argmin ||x||2 s.t. x 2 BF x*=[-1,1]
Return A* = {i: x*(i) < 0} A*={a}
Theorem [Fujishige ’91]: A* is an optimal solution!
Note: Can solve 1. using Wolfe’s algorithm
Runtime finite but unknown!! 40
41. Empirical comparison
[Fujishige et al ’06]
Cut functions
from DIMACS
Lower is better (log-scale!)
Challenge
Running time (seconds)
Minimum
norm algorithm
64 128 256 512 1024
Problem size (log-scale!)
Minimum norm algorithm orders of magnitude faster!
Our implementation can solve n = 10k in < 6 minutes! 41
42. Checking optimality (duality)
BF A F(A)
w{b} ; 0
x*
2 {a} -1
{b} 2
[-1,1] 1 {a,b} 0
Base polytope:
-2 -1 0 1 w{a} BF = PF Å {x(V) = F(V)}
Theorem [Edmonds ’70]
A = {a}, F(A) = -1
minA F(A) = maxx {x (V) : x 2 BF}
–
w = [1,0]
where x–(s) = min {x(s), 0} xw = [-1,1]
xw- = [-1,0]
Testing how close A’ is to minA F(A) xw-(V) = -1
Run greedy algorithm for w=wA’ to get xw
A optimal!
F(A’) ¸ minA F(A) ¸ xw–(V) 42
43. Overview minimization
Minimizing general submodular functions
Can minimizing in polytime using ellipsoid method
Combinatorial, strongly polynomial algorithm O(n^8)
Practical alternative: Minimum norm algorithm?
Minimizing symmetric submodular functions
Applications to Machine Learning
43
44. What if we have special structure?
Worst-case complexity of best known algorithm: O(n8 log2n)
Can we do better for special cases?
Example (again): Given RVs X1,…,Xn
F(A) = I(XA; XVnA)
= I(XVnA ; XA)
= F(VnA)
Functions F with F(A) = F(VnA) for all A are symmetric
44
45. Another example: Cut functions
2 1 3 V={a,b,c,d,e,f,g,h}
a c e g
2 3
2 2 2 3 3 3
b d f h
2 1 3
F(A) = ∑ {ws,t: s2 A, t2 Vn A}
Example: F({a})=6; F({c,d})=10; F({a,b,c,d})=2
Cut function is symmetric and submodular!
45
46. Minimizing symmetric functions
For any A, submodularity implies
2 F(A) = F(A) + F(VnA)
¸ F(A Å (VnA))+F(A [ (VnA))
= F(;) + F(V)
= 2 F(;) = 0
Hence, any symmetric SF attains minimum at ;
In practice, want nontrivial partition of V into
A and VnA, i.e., require that A is neither ; of V
Want A* = argmin F(A) s.t. 0 < |A| < n
There is an efficient algorithm for doing that! 46
47. Queyranne’s algorithm (overview)
[Queyranne ’98]
Theorem: There is a fully combinatorial, strongly M
polynomial algorithm for solving
A* = argminA F(A) s.t. 0<|A|<n
for symmetric submodular functions A
Runs in time O(n3) [instead of O(n8)…]
Note: also works for “posimodular” functions:
F posimodular A,Bµ V: F(A)+F(B) ¸ F(AnB)+F(BnA) 47
48. Gomory Hu trees
2 1 3 6 2 9
a c e g a c e g
2 3
2 2 2 3 3 3 6 7 10 9
b d f h b d f h
2 1 3
G T
A tree T is called Gomory-Hu (GH) tree for SF F
if for any s, t 2 V it holds that
min {F(A): s2A and t∉A} =
min {wi,j: (i,j) is an edge on the s-t path in T}
“min s-t-cut in T = min s-t-cut in G”
Expensive to
Theorem [Queyranne ‘93]: find one in
GH-trees exist for any symmetric SF F! general! 48
49. Pendent pairs
For function F on V, s,t2 V: (s,t) is pendent pair if
{s} 2 argminA F(A) s.t. s2A, t∉A
Pendent pairs always exist:
6 2 9
a c e g
7 9 Gomory-Hu
6 10 tree T
b d f h
Take any leaf s and neighbor t, then (s,t) is pendent!
E.g., (a,c), (b,c), (f,e), …
Theorem [Queyranne ’95]: Can find pendent pairs in O(n2)
(without needing GH-tree!)
49
50. Why are pendent pairs useful?
Key idea: Let (s,t) pendent, A* = argmin F(A)
Then EITHER V
s and t separated by A*, e.g., s2A*, t∉A*. A*
But then A*={s}!! OR s t
s and t are not separated by A*
V V
A* s A*
t s t
Then we can merge s and t…
50
51. Merging
Suppose F is a symmetric SF on V,
and we want to merge pendent pair (s,t)
Key idea: “If we pick s, get t for free”
V’ = Vn{t}
F’(A) = F(A[{t}) if s2A, or
= F(A) if s∉A
Lemma: F’ is still symmetric and submodular!
2 1 3 1 3
a c e g Merge a,c e g
2 3 (a,c) 3
2 2 2 3 3 3 4 4 3 3 3
b d f h b d f h
2 1 3 2 1 3 51
52. Queyranne’s algorithm
Input: symmetric SF F on V, |V|=n
Output: A* = argmin F(A) s.t. 0 < |A| < n
Initialize F’ Ã F, and V’ Ã V
For i = 1:n-1
(s,t) Ã pendentPair(F’,V’)
Ai = {s}
(F’,V’) Ã merge(F’,V’,s,t)
Return argmini F(Ai)
Running time: O(n3) function evaluations
52
53. Note: Finding pendent pairs
Initialize v1 Ã x (x is arbitrary element of V)
For i = 1 to n-1 do
Wi à {v1,…,vi}
vi+1 Ã argminv F(Wi[{v}) - F({v}) s.t. v2 VnWi
Return pendent pair (vn-1,vn)
Requires O(n2) evaluations of F
53
54. Overview minimization
Minimizing general submodular functions
Can minimizing in polytime using ellipsoid method
Combinatorial, strongly polynomial algorithm O(n8)
Practical alternative: Minimum norm algorithm?
Minimizing symmetric submodular functions
Many useful submodular functions are symmetric
Queyranne’s algorithm minimize symmetric SFs in O(n3)
Applications to Machine Learning
54
55. Application: Clustering
[Narasimhan, Jojic, Bilmes NIPS ’05]
Group data points V into
“homogeneous clusters”
o V
o Find a partition V=A1 [ … [ Ak
o o o that minimizes
o o A1 o A
o
o o 2
F(A1,…,Ak) = ∑i E(Ai)
“Inhomogeneity of Ai”
Examples for E(A):
Entropy H(A)
Cut function
Special case: k = 2. Then F(A) = E(A) + E(VnA) is symmetric!
If E is submodular, can use Queyranne’s algorithm! 55
56. What if we want k>2 clusters?
[Zhao et al ’05, Narasimhan et al ‘05]
X2
Greedy Splitting algorithm M X1 X3
X5 X7
X4 X6
Start with partition P = {V}
For i = 1 to k-1 X1
X3
X2 X7
For each member Cj 2 P do X4 X6 X5
split cluster Cj:
A* = argmin E(A) + E(CjnA) s.t. 0<|A|<|Cj|
Pj à P n {Cj} [ {A,CjnA} X3
X1 X2 X7
Partition we get by splitting j-th cluster X6
X4 X5
P Ã argminj F(Pj)
Theorem: F(P) · (2-2/k) F(Popt)
56
57. Example: Clustering species
[Narasimhan et al ‘05]
Common genetic information = #of common substrings:
Can easily extend to sets of species
57
58. Example: Clustering species
[Narasimhan et al ‘05]
The common genetic information ICG
does not require alignment
captures genetic similarity
is smallest for maximally evolutionarily diverged species
is a symmetric submodular function!
Greedy splitting algorithm yields phylogenetic tree!
58
59. Example: SNPs
[Narasimhan et al ‘05]
Study human genetic variation
(for personalized medicine, …)
Most human variation due to point mutations that occur
once in human history at that base location:
Single Nucleotide Polymorphisms (SNPs)
Cataloging all variation too expensive
($10K-$100K per individual!!)
59
60. SNPs in the ACE gene
[Narasimhan et al ‘05]
Rows: Individuals. Columns: SNPs.
Which columns should we pick to reconstruct the rest?
Can find near-optimal clustering (Queyranne’s algorithm)
60
61. Reconstruction accuracy
[Narasimhan et al ‘05]
Comparison with
clustering based on
Entropy
Prediction accuracy
Pairwise correlation
# of clusters PCA
61
62. Example: Speaker segmentation
[Reyes-Gomez, Jojic ‘07] Region A “Fiona”
Mixed waveforms
Frequency
Alice “???”
Time
Fiona “???” E(A)=-log p(XA)
Partition Likelihood of
Spectro- “region” A
“308” gram
F(A)=E(A)+E(VnA)
using
Q-Algo
symmetric
“217” & posimodular
62
64. Example: Image denoising
Pairwise Markov Random Field
Y1 Y2 Y3
X1 X2 X3 P(x1,…,xn,y1,…,yn)
= ∏i,j ψi,j(yi,yj) Πi φi(xi,yi)
Y4 Y5 Y6
X4 X5 X6 Want argmaxy P(y | x)
Y7 Y8 Y9 =argmaxy log P(x,y)
=argminy ∑i,j Ei,j(yi,yj)+∑i Ei(yi)
X7 X8 X9
Xi: noisy pixels Ei,j(yi,yj) = -log ψi,j(yi,yj)
Yi: “true” pixels
When is this MAP inference efficiently solvable
(in high treewidth graphical models)? 64
65. MAP inference in Markov Random Fields
[Kolmogorov et al, PAMI ’04, see also: Hammer, Ops Res ‘65]
Energy E(y) = ∑i,j Ei,j(yi,yj)+∑i Ei(yi)
Suppose yi are binary, define
F(A) = E(yA) where yAi = 1 iff i2 A
Then miny E(y) = minA F(A)
Theorem
MAP inference problem solvable by graph cuts
For all i,j: Ei,j(0,0)+Ei,j(1,1) · Ei,j(0,1)+Ei,j(1,0)
each Ei,j is submodular
“Efficient if prefer that neighboring pixels have same color” 65
66. Constrained minimization
Have seen: if F submodular on V, can solve
A*=argmin F(A) s.t. A2V
What about
A*=argmin F(A) s.t. A2V and |A| ·k
E.g., clustering with minimum # points per cluster, …
In general, not much known about constrained minimization
However, can do
A*=argmin F(A) s.t. 0<|A|< n
A*=argmin F(A) s.t. |A| is odd/even [Goemans&Ramakrishnan ‘95]
A*=argmin F(A) s.t. A 2 argmin G(A) for G submodular [Fujishige ’91]
66
67. Overview minimization
Minimizing general submodular functions
Can minimizing in polytime using ellipsoid method
Combinatorial, strongly polynomial algorithm O(n8)
Practical alternative: Minimum norm algorithm?
Minimizing symmetric submodular functions
Many useful submodular functions are symmetric
Queyranne’s algorithm minimize symmetric SFs in O(n3)
Applications to Machine Learning
Clustering [Narasimhan et al’ 05]
Speaker segmentation [Reyes-Gomez & Jojic ’07]
MAP inference [Kolmogorov et al ’04]
67
68. Tutorial Overview
Examples and properties of submodular functions
Many problems submodular (mutual information, influence, …)
SFs closed under positive linear combinations; not under min, max
Submodularity and convexity
Every SF induces a convex function with SAME minimum
Special properties: Greedy solves LP over exponential polytope
Minimizing submodular functions
Minimization possible in polynomial time (but O(n8)…)
Queyranne’s algorithm minimizes symmetric SFs in O(n3)
Useful for clustering, MAP inference, structure learning, …
Maximizing submodular functions
Extensions and research directions
68
70. Maximizing submodular functions
Minimizing convex functions: Minimizing submodular functions:
Polynomial time solvable! Polynomial time solvable!
Maximizing convex functions: Maximizing submodular functions:
NP hard! NP hard!
But can get
approximation
guarantees
70
71. Maximizing influence
[Kempe, Kleinberg, Tardos KDD ’03]
Dorothy Eric
Alice 0.2
0.5 0.4
0.3 0.2 0.5
Bob
0.5 Fiona
Charlie
F(A) = Expected #people influenced when targeting A
F monotonic: If AµB: F(A) ·F(B)
Hence V = argmaxA F(A)
More interesting: argmaxA F(A) – Cost(A) 71
72. Maximizing non-monotonic functions
maximum
Suppose we want for not monotonic F
A* = argmax F(A) s.t. AµV
|A|
Example:
F(A) = U(A) – C(A) where U(A) is submodular utility,
and C(A) is supermodular cost function
E.g.: Trading off utility and privacy in personalized search
[Krause & Horvitz AAAI ’08]
In general: NP hard. Moreover:
If F(A) can take negative values:
As hard to approximate as maximum independent set
72
1-ε
73. Maximizing positive submodular functions
[Feige, Mirrokni, Vondrak FOCS ’07]
Theorem
There is an efficient randomized local search procedure,
that, given a positive submodular function F, F(;)=0,
returns set ALS such that
F(ALS) ¸(2/5) maxA F(A)
picking a random set gives ¼ approximation
(½ approximation if F is symmetric!)
we cannot get better than ¾ approximation unless P = NP
73
74. Scalarization vs. constrained maximization
Given monotonic utility F(A) and cost C(A), optimize:
Option 1: Option 2:
maxA F(A) – C(A) maxA F(A)
s.t. A µ V s.t. C(A) · B
“Scalarization” “Constrained maximization”
Can get 2/5 approx… coming up…
if F(A)-C(A) ¸ 0
for all A µ V
Positiveness is a
strong requirement 74
76. Monotonicity
A set function is called monotonic if
AµBµV ) F(A) · F(B)
Examples:
Influence in social networks [Kempe et al KDD ’03]
For discrete RVs, entropy F(A) = H(XA) is monotonic:
Suppose B=A [C. Then
F(B) = H(XA, XC) = H(XA) + H(XC | XA) ¸ H(XA) = F(A)
Information gain: F(A) = H(Y)-H(Y | XA)
Set cover
Matroid rank functions (dimension of vector spaces, …)
…
76
77. Subset selection
Given: Finite set V, monotonic submodular function F, F(;) = 0
Want: A*µ V such that
NP-hard!
77
78. Exact maximization of monotonic submodular functions
1) Mixed integer programming [Nemhauser et al ’81]
max η
s.t. η · F(B) + ∑s2VnB αs δ s(B) for all B µ S
∑s αs · k
αs 2 {0,1}
where δ s(B) = F(B [ {s}) – F(B)
Solved using constraint generation
2) Branch-and-bound: “Data-correcting algorithm” M
[Goldengorin et al ’99]
Both algorithms worst-case exponential! 78
79. Approximate maximization
Given: finite set V, monotonic submodular function F(A)
Want: A*µ V such that
NP-hard! Y
“Sick”
Greedy algorithm: M
X1 X2 X3
Start with A0 = ;
“Fever” “Rash” “Male”
For i = 1 to k
si := argmaxs F(Ai-1 [ {s}) - F(Ai-1)
Ai := Ai-1 [ {si}
79
80. Performance of greedy algorithm
Theorem [Nemhauser et al ‘78]
Given a monotonic submodular function F, F(;)=0, the
greedy maximization algorithm returns Agreedy
F(Agreedy) ¸(1-1/e) max|A|· k F(A)
~63%
Sidenote: Greedy algorithm gives
1/2 approximation for
maximization over any matroid C!
[Fisher et al ’78]
80
81. An “elementary” counterexample
X1, X2 ~ Bernoulli(0.5) X1 X2
Y = X1 XOR X2
Y
Let F(A) = IG(XA; Y) = H(Y) – H(Y|XA)
Y | X1 and Y | X2 ~ Bernoulli(0.5) (entropy 1)
Y | X1,X2 is deterministic! (entropy 0)
Hence F({1,2}) – F({1}) = 1, but
F({2}) – F(;) =0
F(A) submodular under some conditions! (later) 81
82. Example: Submodularity of info-gain
Y1,…,Ym, X1, …, Xn discrete RVs
F(A) = IG(Y; XA) = H(Y)-H(Y | XA)
F(A) is always monotonic
However, NOT always submodular
Theorem [Krause & Guestrin UAI’ 05]
If Xi are all conditionally independent given Y,
then F(A) is submodular!
Y1 Y2 Y3 Hence, greedy algorithm works!
In fact, NO algorithm can do better
X1 X2 X3 X4 than (1-1/e) approximation! 82
83. Building a Sensing Chair
[Mutlu, Krause, Forlizzi, Guestrin, Hodgins UIST ‘07]
People sit a lot
Activity recognition in
assistive technologies
Seating pressure as
user interface
Equipped with
1 sensor per cm2!
Costs $16,000!
Lean Lean Slouch
Can we get similar left forward
accuracy with fewer, 82% accuracy on
cheaper sensors? 10 postures! [Tan et al]83
84. How to place sensors on a chair?
Sensor readings at locations V as random variables
Predict posture Y using probabilistic model P(Y,V)
Pick sensor locations A* µ V to minimize entropy:
Possible locations V
Placed sensors, did a user study:
Accuracy Cost
Before 82% $16,000
After 79% $100
Similar accuracy at <1% of the cost!
84
85. Variance reduction
(a.k.a. Orthogonal matching pursuit, Forward Regression)
Let Y = ∑i αi Xi+ε, and (X1,…,Xn,ε) » N(¢; µ,Σ)
Want to pick subset XA to predict Y
Var(Y | XA=xA): conditional variance of Y given XA = xA
Expected variance: Var(Y | XA) = s p(xA) Var(Y | XA=xA) dxA
Variance reduction: FV(A) = Var(Y) – Var(Y | XA)
FV(A) is always monotonic
*under some
Theorem [Das & Kempe, STOC ’08] conditions on Σ
FV(A) is submodular*
Orthogonal matching pursuit near optimal!
[see other analyses by Tropp, Donoho et al., and Temlyakov]
85
86. Batch mode active learning [Hoi et al, ICML’06]
o o
Which data points o should we
o +
o
o –
o o o label to minimize error?
+ o
o
o
– o
o Want batch A of k points to
o show an expert for labeling
F(A) selects examples that are
uncertain [σ2(s) = π(s) (1-π(s)) is large]
diverse (points in A are as different as possible)
relevant (as close to VnA is possible, sT s’ large)
F(A) is submodular and monotonic!
[approximation to improvement in Fisher-information] 86
87. Results about Active Learning
[Hoi et al, ICML’06]
Batch mode Active Learning performs better than
Picking k points at random
Picking k points of highest entropy
87
88. Monitoring water networks
[Krause et al, J Wat Res Mgt 2008]
Contamination of drinking water
could affect millions of people
Contamination
Sensors
Simulator from EPA Hach Sensor
Place sensors to detect contaminations ~$14K
“Battle of the Water Sensor Networks” competition
Where should we place sensors to quickly detect contamination?
88
89. Model-based sensing
Utility of placing sensors based on model of the world
For water networks: Water flow simulator from EPA
F(A)=Expected impact reduction placing sensors at A
Model predicts Low impact
High impact
Theorem [Krause et al., J Wat location :
Contamination Res Mgt ’08]
Impact reduction F(A) in water networks is submodular!
Medium impact
S3 location
S1 2
S S3
S1 S4 S2
Sensor reduces
S4
impact through
Set V of all early detection!
S1
network junctions
High impact reduction F(A) = 0.9 Low impact reduction F(A)=0.01
89
90. Battle of the Water Sensor Networks Competition
Real metropolitan area network (12,527 nodes)
Water flow simulator provided by EPA
3.6 million contamination events
Multiple objectives:
Detection time, affected population, …
Place sensors that detect well “on average”
90
91. Bounds on optimal solution
[Krause et al., J Wat Res Mgt ’08]
1.4
Offline
Population protected F(A)
1.2 (Nemhauser)
bound
Higher is better
1
0.8
0.6 Water
Greedy
networks
0.4 solution
data
0.2
0
0 5 10 15 20
Number of sensors placed
(1-1/e) bound quite loose… can we get better bounds?
91
92. Data dependent bounds
[Minoux ’78]
Suppose A is candidate solution to
argmax F(A) s.t. |A| · k
and A* = {s1,…,sk} be an optimal solution
Then F(A*) · F(A [ A*)
= F(A)+∑i F(A[{s1,…,si})-F(A[ {s1,…,si-1})
· F(A) + ∑i (F(A[{si})-F(A))
= F(A) + ∑i δ si
M
For each s 2 VnA, let δ s = F(A[{s})-F(A)
Order such that δ 1 ¸ δ 2 ¸ … ¸ δ n 92
93. Bounds on optimal solution
[Krause et al., J Wat Res Mgt ’08]
1.4
Offline
1.2 (Nemhauser)
Sensing quality F(A)
bound Data-dependent
Higher is better
1 bound
0.8
0.6 Water
Greedy
networks
0.4 solution
data
0.2
0
0 5 10 15 20
Number of sensors placed
Submodularity gives data-dependent bounds on the
performance of any algorithm 93
94. BWSN Competition results
[Ostfeld et al., J Wat Res Mgt 2008]
13 participants
Performance measured in 30 different criteria
G: Genetic algorithm D: Domain knowledge
H: Other heuristic E: “Exact” method (MIP)
30
25
G H E
Higher is better
20
Total Score
E G 15
H
G 10
G H D D G 5
0
.
i
l.
.
.
an
i
s
r
ld
ll
ou
sk
.
el
al
al
al
lle
on
ta
do
al
tfe
Gu
m
al
rp
et
et
et
Pi
m
rk
ie
W
ht
Os
et
ca
lo
an
&
Ba
g
rry
rin
ac
an
&
ly
Sa
&
o
Gu
se
Be
Tr
Po
Do
&
at
u
Hu
s
&
W
24% better performance than runner-up!
op
ei
i re
au
&
ld
Pr
Pr
im
s
tfe
Kr
de
Gh
Os
ia
El
94
95. What was the trick?
Simulated all 3.6M contaminations on 2 weeks / 40 processors
152 GB data on disk, 16 GB in main memory (compressed)
Very accurate computation of F(A) Very slow evaluation of F(A)
30 hours/20 sensors
Running time (minutes)
300
Lower is better
Exhaustive search
(All subsets)
6 weeks for all
200
30 settings
Naive
greedy
ubmodularity
100
to the rescue
0
1 2 3 4 5 6 7 8 9 10
Number of sensors selected
95
96. Scaling up greedy algorithm
[Minoux ’78]
In round i+1,
have picked Ai = {s1,…,si}
pick si+1 = argmaxs F(Ai [ {s})-F(Ai)
I.e., maximize “marginal benefit” δ s(Ai)
δ s(Ai) = F(Ai [ {s})-F(Ai)
Key observation: Submodularity implies
δ s(Ai) ¸ δ s(Ai+1)
i · j ) δ s(Ai) ¸ δ s(Aj) s
Marginal benefits can never increase!
96
97. “Lazy” greedy algorithm
[Minoux ’78]
Lazy greedy algorithm: M
First iteration as usual Benefit δ s(A)
a
Keep an ordered list of marginal
benefits δ i from previous iteration b
d
Re-evaluate δ i only for top b
c
element
e
d
If δ i stays on top, use it,
c
e
otherwise re-sort
Note: Very easy to compute online bounds, lazy evaluations, etc.
[Leskovec et al. ’07]
97
98. Result of lazy evaluation
Simulated all 3.6M contaminations on 2 weeks / 40 processors
152 GB data on disk, 16 GB in main memory (compressed)
Very accurate computation of F(A) Very slow evaluation of F(A)
30 hours/20 sensors
Running time (minutes)
300
Lower is better
Exhaustive search
(All subsets)
6 weeks for all
200
30 settings
Naive
greedy
ubmodularity
100 Fast greedy
to the rescue:
0 Using “lazy evaluations”:
1 2 3 4 5 6 7 8 9 10
Number of sensors selected 1 hour/20 sensors
Done after 2 days!
98
99. What about worst-case?
[Krause et al., NIPS ’07]
Knowing the sensor locations, an
adversary contaminates here!
S3
S2 S3
S4 S2
S1
S4
S1
Placement detects Very different average-case impact,
well on “average-case” Same worst-case impact
(accidental) contamination
Where should we place sensors to quickly detect in the worst case?
99
100. Constrained maximization: Outline
Utility function Selected set
Selection cost Budget
Subset selection
Robust optimization Complex constraints
100
101. Optimizing for the worst case
Separate utility function Fi for each contamination i
Fi(A) = impact reduction by sensors A for contamination i
Contamination
at node s Sensors A
Want to solve (A)
Fs(B) is high
Sensors B
Contamination
Each of the Fi is submodular at node r
Unfortunately, mini Fi not submodular! Fr(A) is high
(B) low
101
How can we solve this robust optimization problem?
102. How does the greedy algorithm do?
V={ , , }
Can only buy k=2 Set A F1 F2 mini Fi
1 0 0
Greedy picks
0 2 0 first
Optimal ε ε ε
Hence we can’t find any Then, can
solution 1 ε ε
choose only
approximation algorithm.
ε 2 ε or
Optimal score: 1 Or can 2 1
1 we? Greedy score: ε
Greedy does arbitrarily badly. Is there something better?
Theorem [NIPS ’07]: The problem max|A|· k mini Fi(A)
does not admit any approximation unless P=NP
102
103. Alternative formulation
If somebody told us the optimal value,
can we recover the optimal solution A*?
Need to find
Is this any easier?
Yes, if we relax the constraint |A| · k
103
104. Solving the alternative problem
Trick: For each Fi and c, define truncation
Fi(A)
c
Remains F’i,c(A)
submodular!
|A|
Problem 1 (last slide) Problem 2
Non-submodular optimal solutions!Submodular!
Same
Don’t know how to solve solves the other as constraint?
Solving one But appears
104
105. Maximization vs. coverage
Previously: Wanted
A* = argmax F(A) s.t. |A| · k
Now need to solve:
A* = argmin |A| s.t. F(A) ¸ Q
M
Greedy algorithm:
Start with A := ;; For bound, assume
While F(A) < Q and |A|< n F is integral.
s* := argmaxs F(A [ {s}) If not, just round it.
A := A [ {s*}
Theorem [Wolsey et al]: Greedy will return Agreedy
|A | · (1+log max F({s})) |A | 105
106. Solving the alternative problem
Trick: For each Fi and c, define truncation
Fi(A)
c
F’i,c(A)
|A|
Problem 1 (last slide) Problem 2
Non-submodular Submodular!
Don’t know how to solve Can use greedy algorithm!
106
107. Back to our example
Guess c=1 Set A F1 F2 mini Fi F’avg,1
First pick 1 0 0 ½
Then pick 0 2 0 ½
Optimal solution! ε ε ε ε
1 ε ε (1+ε)/2
ε 2 ε (1+ε)/2
1 2 1 1
How do we find c?
Do binary search!
107
108. SATURATE Algorithm
[Krause et al, NIPS ‘07]
Given: set V, integer k and monotonic SFs F1,…,Fm M
Initialize cmin=0, cmax = mini Fi(V)
Do binary search: c = (cmin+cmax)/2
Greedily find AG such that F’avg,c(AG) = c
If |AG| · α k: increase cmin
If |AG| > α k: decrease cmax
until convergence
Truncation
threshold
(color)
108
109. Theoretical guarantees
[Krause et al, NIPS ‘07]
Theorem: The problem max|A|· k mini Fi(A)
does not admit any approximation unless P=NP
Theorem: SATURATE finds a solution AS such that
mini Fi(AS) ¸ OPTk and |AS| · α k
where OPTk = max|A|·k mini Fi(A)
α = 1 + log maxs ∑i Fi({s})
Theorem:
If there were a polytime algorithm with better factor
β < α, then NP µ DTIME(nlog log n)
109
110. Example: Lake monitoring
Monitor pH values using robotic sensor
transect
Prediction at unobserved
Observations A locations
pH value
True (hidden) pH values
Var(s | A) Use probabilistic model
(Gaussian processes)
Position s along transect to estimate prediction error
Where should we sense to minimize our maximum error?
Robust submodular
(often) submodular
optimization problem! [Das & Kempe ’08] 110
111. Comparison with state of the art
Algorithm used in geostatistics: Simulated Annealing
[Sacks & Schiller ’88, van Groeningen & Stein ’98, Wiens ’05,…]
7 parameters that need to be fine-tuned
0.25
Maximum marginal variance
nce
2.5
0.2
a um arginal varia
better
G dy
ree Greedy
2
0.15
SATURATE 1.5 Simulated
M xim m
0.1 Annealing
1
0.05 S u
im lated
A n alin
ne g SATURATE
0.5
0
0 20 40 60 0 20 40 60 80 100
SATURATE is competitive & 10x faster
N m er of se sors
u b n Number of sensors
Environmental monitoring Precipitation data
No parameters to tune!
111
112. Results on water networks
3000
Maximum detection time (minutes) No decrease
2500 until all
Greedy
Lower is better
2000
contaminations
Simulated detected!
1500 Annealing
1000
500 SATURATE
0
0 10 20 Water networks
Number of sensors
60% lower worst-case detection time!
112
113. Worst- vs. average case
Given: Set V, submodular functions F1,…,Fm
Average-case score Worst-case score
Too optimistic? Very pessimistic!
Want to optimize both average- and worst-case score!
Can modify SATURATE to solve this problem!
Want: Fac(A) ¸ cac and Fwc(A) ¸ cwc
Truncate: min{Fac(A),cac} + min{Fwc(A),cwc} ¸ cac+cwc 113
114. Worst- vs. average case
7000
Only optimize for
6000 average case
Worst case impact
lower is better
5000
Knee in
4000
tradeoff curve Water
3000 Tradeoffs networks
(SATURATE)
2000
data
1000 Only optimize for worst case
0
0 50 100 150 200 250 300 350
Can findAverage case impact
good compromise between
average- and is better
lower worst-case score!
114
115. Constrained maximization: Outline
Utility function Selected set
Selection cost Budget
Subset selection
Robust optimization Complex constraints
115
116. Other aspects: Complex constraints
maxA F(A) or maxA mini Fi(A) subject to
So far: |A| · k
In practice, more complex constraints:
Different costs: C(A) · B
Locations need to be Sensors need to communicate
connected by paths (form a routing tree)
[Chekuri & Pal, FOCS ’05]
[Singh et al, IJCAI ’07]
Lake monitoring Building monitoring
116
117. Non-constant cost functions
For each s 2 V, let c(s)>0 be its cost
(e.g., feature acquisition costs, …)
Cost of a set C(A) = ∑s2 A c(s) (modular function!)
Want to solve
A* = argmax F(A) s.t. C(A) · B
Cost-benefit greedy algorithm: M
Start with A := ;;
While there is an s2VnA s.t. C(A[{s}) · B
A := A [ {s*}
117
118. Performance of cost-benefit greedy
Want Set A F(A) C(A)
{a} 2ε ε
maxA F(A) s.t. C(A)· 1 {b} 1 1
Cost-benefit greedy picks a.
Then cannot afford b!
Cost-benefit greedy performs arbitrarily badly!
118
119. Cost-benefit optimization
[Wolsey ’82, Sviridenko ’04, Leskovec et al ’07]
Theorem [Leskovec et al. KDD ‘07]
ACB: cost-benefit greedy solution and
AUC: unit-cost greedy solution (i.e., ignore costs)
Then
max { F(ACB), F(AUC) } ¸ ½ (1-1/e) OPT
Can still compute online bounds and
speed up using lazy evaluations
Note: Can also get
(1-1/e) approximation in time O(n4) [Sviridenko ’04]
Slightly better than ½ (1-1/e) in O(n2) [Wolsey ‘82]
119
120. Example: Cascades in the Blogosphere
[Leskovec, Krause, Guestrin, Faloutsos, VanBriesen, Glance ‘07]
Time
Learn about
story after us!
Information
cascade
Which blogs should we read to learn about big cascades early? 120
121. Water vs. Web
Placing sensors in Selecting
water networks vs. informative blogs
In both problems we are given
Graph with nodes (junctions / blogs) and edges (pipes / links)
Cascades spreading dynamically over the graph (contamination / citations)
Want to pick nodes to detect big cascades early
In both applications, utility functions submodular
[Generalizes Kempe et al, KDD ’03]
121
122. Performance on Blog selection
0.7 400
Lower is better
Greedy
Higher is better
Running time (seconds)
0.6 Exhaustive search
Cascades captured
300 (All subsets)
0.5
0.4 Naive
200 greedy
In-links
0.3 All outlinks
0.2 # Posts 100
Fast greedy
0.1 Random
0
0
0 20 40 60 80 100 1 2 3 4 5 6 7 8 9 10
Number of blogs Number of blogs selected
Blog selection Blog selection
~45k blogs
Outperforms state-of-the-art heuristics
700x speedup using submodularity!
122
123. Cost of reading a blog skip
Naïve approach: Just pick 10 best blogs
Selects big, well known blogs (Instapundit, etc.)
These contain many posts, take long to read!
Cost/benefit
0.6
Cascades captured
analysis
0.4
Ignoring cost
0.2
0
0
0 2 1 4 26 8 3 10 412 5
14
4
x10
Cost(A) = Number of posts / day
Cost-benefit optimization picks summarizer blogs!
123
124. Predicting the “hot” blogs
Want blogs that will be informative in the future
Split data set; train on historic, test on future
Detects on training set
0.25 Greedy on future G edy
re
Test on future
Cascades captured
s
“Cheating”
ction
0.2
200
#dete
0.15
0
Jan Feb M ar Ar
p May
0.1 Greedy on historic
Test on future Blog selection “overfits”
Detect well
S rate
atu
Detect poorly
s
0.05
#detection 200
here! training data!
to here!
0
0
0 2 1000
4 6 2000
8 10
3000 12 14
4000 Poor generalization!
Cost(A) = Number of posts / day 0
Want blogs that May
JaWhy’s that? Apr
n F eb Mar
Let’s see what continue to do well!
goes wrong here.
124
125. Robust optimization G edy
re
“Overfit” blog
ctions
200
selection A
#dete
0
Jan Feb M ar Ar
p May
Fi(A) = detections S rate
atu
F1(A)=.5 F (A)=.6 F5 (A)=.02
G edy
re
3
in interval i
#dete s s
#dete ction
ction
F2 (A)=.8 F4(A)=.01
200
200
Optimize 0
Jan Feb Mar A pr M ay
0
worst-case Jan Feb M ar
S rate
atu
Ar
p May
s
Detections using SATURATE
ction
“Robust” blog 200
#dete
selection A* 0
Jan Feb Mar Apr May
Robust optimization Regularization!
125
126. Predicting the “hot” blogs
0.25 Greedy on future
Test on future Robust solution
Cascades captured “Cheating”
0.2 Test on future
0.15
0.1 Greedy on historic
Test on future
0.05
0
00 2 4
1000 6 2000
8 10
300012 14
4000
Cost(A) = Number of posts / day
50% better generalization!
126
127. Other aspects: Complex constraints skip
maxA F(A) or maxA mini Fi(A) subject to
So far: |A| · k
In practice, more complex constraints:
Different costs: C(A) · B
Sensors need to communicate
Locations need to be (form a routing tree)
connected by paths
[Chekuri & Pal, FOCS ’05]
[Singh et al, IJCAI ’07]
Building monitoring
Lake monitoring
127
128. Naïve approach: Greedy-connect long
Simple heuristic: Greedily optimize submodular utility function F(A)
Then add nodes to minimize communication cost C(A)
50
1 51
OFFICE
2 2
OFFICE 54 9
11
12
QUIET PHONE
15
16
1.5
8
53
relayF(A) =
52 10 13 14
No communication 0.2
CONFERENCE
49
17
1
7
Most
18
F(A) = 4
STORAGE
48
ELEC COPY 5 node
possible! 6
LAB
19
11
informative
2 C(A) 1 3
47
C(A) = 1 = C(A) = 10
4 20
46 21
3
45
2
SERVER
22
44
relay
43
39
KITCHEN
37
1
33
Second
29 27
23
= node C(A) 2 3.5 most informative
35
F(A)2 3.5 =
40 31
efficient
25 24
42 41 34
38 36 32 30 28 26
communication!
Very informative,
Communication cost = Expected # of trials Not very
High communication
(learned using Gaussian Processes)
Want to find optimal tradeoff informative
cost!
between information and communication cost
128
129. The pSPIEL Algorithm
[Krause, Guestrin, Gupta, Kleinberg IPSN 2006]
pSPIEL: Efficient nonmyopic algorithm
(padded Sensor Placements at Informative and cost-
Effective Locations)
1 2 Decompose sensing region
2
1 into small, well-separated
C1
C2 clusters
Solve cardinality constrained
3 2 problem per cluster (greedy)
1 3 C4 C3 Combine solutions using
2 1
k-MST algorithm
129
Notes de l'éditeur
Get p+1 approximation for C = intersection of p matroids; 1-1/e whp for any matroid using continuous greedy + pipage rounding
Random placement: 53%; Uniform: 73% Optimized: 81%