SlideShare une entreprise Scribd logo
1  sur  150
Beyond Convexity –
Submodularity in Machine Learning

            Andreas Krause, Carlos Guestrin
                    Carnegie Mellon University

      International Conference on Machine Learning | July 5, 2008




                                                         Carnegie Mellon
Acknowledgements
Thanks for slides and material to Mukund Narasimhan,
Jure Leskovec and Manuel Reyes Gomez

MATLAB Toolbox and details for references available at

    http://www.submodularity.org


     Algorithms implemented   M



                                                         2
Optimization in Machine Learning
           +                           Classify + from – by
                             w1        finding a separating
        + +           -
                    -                  hyperplane (parameters w)
        + +           -
                   -
        w*       - -                 Which one should we choose?
               w2

                          Define loss L(w) = “1/size of margin”

         Solve for best vector         w* = argminw L(w)

                                                         L(w)
Key observation: Many problems in ML are convex!
                                                                   w
              no local minima!!                                  3
Feature selection
Given random variables Y, X1, … Xn
Want to predict Y from subset XA = (Xi1,…,Xik)
                            Y
                          “Sick”
                                                 Naïve Bayes
                 X1        X2         X3         Model
              “Fever”    “Rash”     “Male”




Want k most informative features:

            A* = argmax IG(XA; Y) s.t. |A| · k
      Uncertainty             Uncertainty
   before knowing XA       after knowing XA
where IG(XA; Y) = H(Y) - H(Y | XA)

                                                               4
Factoring distributions
                                        V
Given random variables X1,…,Xn               X1 X3     X2
                                               X4 X6    X5 X7
Partition variables V into sets A and
VnA as independent as possible
                                            X1 X3      X2 X7
                                              X4 X6        X5
Formally: Want                          A
                                                           VnA
    A* = argminA I(XA; XVnA) s.t. 0<|A|<n

where I(XA,XB) = H(XB) - H(XB j XA)

Fundamental building block in structure learning
[Narasimhan&Bilmes, UAI ’04]
                                                                5
Combinatorial problems in ML
Given a (finite) set V, function F: 2V ! R, want
    A* = argmin F(A) s.t. some constraints on A

Solving combinatorial problems:
  Mixed integer programming?
    Often difficult to scale to large problems
  Relaxations? (e.g., L1 regularization, etc.)
    Not clear when they work
  This talk:
  Fully combinatorial algorithms (spanning tree, matching, …)
  Exploit problem structure to get guarantees about solution!


                                                                6
Example: Greedy algorithm for feature selection
 Given: finite set V of features, utility function F(A) = IG(XA; Y)
 Want:      A*µ V such that

                                                      Y
                                                    “Sick”
              NP-hard!
                                        X1           X2          X3
Greedy algorithm:              M      “Fever”      “Rash”      “Male”
  Start with A = ;
  For i = 1 to k
      s* := argmaxs F(A [ {s})
      A := A [ {s*}

 How well can this simple heuristic do?                                 7
Key property: Diminishing returns
        Selection A = {}             Selection B = {X2,X3}
              Y                                Y
            “Sick”                           “Sick”

                                              X2         X3
                                           “Rash”     “Male”
         Adding X1            X1                 Adding X1
Theorem [Krause, Guestrin UAI ‘05]: Information gain F(A) in
                           “Fever”
        will help a lot!                     doesn’t help much
  Naïve Bayes models is submodular!
                             New
                          feature X1
Submodularity:        B A               + s Large improvement
                                      + s       Small improvement

   For Aµ B, F(A [ {s}) – F(A) ¸ F(B [ {s}) – F(B)
                                                                    8
Why is submodularity useful?

Theorem [Nemhauser et al ‘78]
Greedy maximization algorithm returns Agreedy:
     F(Agreedy) ¸(1-1/e) max|A|·k F(A)

                    ~63%

Greedy algorithm gives near-optimal solution!
More details and exact statement later
For info-gain: Guarantees best possible unless P = NP!
[Krause, Guestrin UAI ’05]


                                                         9
Submodularity in Machine Learning
In this tutorial we will see that many ML problems are
submodular, i.e., for F submodular require:
Minimization: A* = argmin F(A)
  Structure learning (A* = argmin I(XA; XVnA))
  Clustering
  MAP inference in Markov Random Fields
  …
Maximization: A* = argmax F(A)
  Feature selection
  Active learning
  Ranking
  …
                                                         10
Tutorial Overview
   Examples and properties of submodular functions

   Submodularity and convexity

   Minimizing submodular functions

   Maximizing submodular functions

   Research directions, …

    LOTS of applications to Machine Learning!!
                                                      11
Submodularity
Properties and Examples




                   Carnegie Mellon
Set functions
Finite set V = {1,2,…,n}
Function F: 2V ! R
Will always assume F(;) = 0 (w.l.o.g.)
Assume black-box that can evaluate F for any input A
  Approximate (noisy) evaluation of F is ok (e.g., [37])
Example: F(A)      = IG(XA; Y) = H(Y) – H(Y | XA)
                   = ∑y,x P(xA) [log P(y | xA) – log P(y)]
                         A
                  Y                           Y
                “Sick”                      “Sick”

       X1         X2                         X2         X3
     “Fever”    “Rash”                     “Rash”     “Male”
     F({X1,X2}) = 0.9                    F({X2,X3}) = 0.5      13
Submodular set functions
  Set function F on V is called submodular if
       For all A,B µ V: F(A)+F(B) ¸ F(A[B)+F(AÅB)

               +             ¸                     +
                A A [ BB
                  AÅB


  Equivalent diminishing returns characterization:

Submodularity:      B A              + S     Large improvement
                                     + S     Small improvement

For AµB, s∉B, F(A [ {s}) – F(A) ¸ F(B [ {s}) – F(B)
                                                                 14
Submodularity and supermodularity
Set function F on V is called submodular if
     1) For all A,B µ V: F(A)+F(B) ¸ F(A[B)+F(AÅB)
  2) For all AµB, s∉B, F(A [ {s}) – F(A) ¸ F(B [ {s}) – F(B)

F is called supermodular if –F is submodular
F is called modular if F is both sub- and supermodular
      for modular (“additive”) F, F(A) = ∑i2A w(i)




                                                          15
Example: Set cover
  Place sensors                   Want to cover floorplan with discs
  in building          Possible
                                        OFFICE   OFFICE                          QUIET   PHONE




                      locations
                                                      CONFERENCE



                                                           STORAGE




                          V
                                                                          LAB

                                                          ELEC   COPY




                                                                        SERVER


                                                            KITCHEN




  Node predicts                  For A µ V: F(A) = “area
values of positions          covered by sensors placed at A”
 with some radius
                                          Formally:
                        W finite set, collection of n subsets Si µ W
                         For A µV={1,…,n} define F(A) = |i2 A Si|
                                                                                                 16
Set cover is submodular
                                                                   A={S1,S2}
OFFICE   OFFICE                                    QUIET   PHONE




              CONFERENCE




         S1                          S2
                   STORAGE


                                            LAB

                  ELEC   COPY




                                          SERVER




                                                           S’
                    KITCHEN




                                                                        F(A[{S’})-F(A)

OFFICE   OFFICE                                    QUIET   PHONE
                                                                               ¸
              CONFERENCE




                                                                       F(B[{S’})-F(B)
         S1                          S2
                   STORAGE


                                            LAB

                  ELEC   COPY




                                S3
                                          SERVER




                                            S4
                    KITCHEN




                                                           S’
                                                                   B = {S1,S2,S3,S4}
                                                                                         17
Example: Mutual information
Given random variables X1,…,Xn
F(A) = I(XA; XVnA) = H(XVnA) – H(XVnA |XA)


Lemma: Mutual information F(A) is submodular

F(A [ {s}) – F(A) = H(Xsj XA) – H(Xsj XVn(A[{s}) )
     Nonincreasing in A:        Nondecreasing in A
 AµB ) H(Xs|XA) ¸ H(Xs|XB)


δ s(A) = F(A[{s})-F(A) monotonically nonincreasing
       F submodular                                18
Example: Influence in social networks
     [Kempe, Kleinberg, Tardos KDD ’03]
                        Dorothy                  Eric
   Alice                            0.2
                                                        Prob. of
                                                        influencing
                  0.5         0.4
       0.3                                0.2    0.5

    Bob
                  0.5                                   Fiona
                          Charlie


Who should get free cell phones?
V = {Alice,Bob,Charlie,Dorothy,Eric,Fiona}
F(A) = Expected number of people influenced when targeting A
                                                                 19
Influence in social networks is submodular
   [Kempe, Kleinberg, Tardos KDD ’03]
                     Dorothy                       Eric
Alice                             0.2


               0.5         0.4
   0.3                                  0.2        0.5

Bob
               0.5                                        Fiona
                       Charlie

Key idea: Flip coins c in advance  “live” edges

Fc(A) = People influenced under outcome c (set cover!)

F(A) = ∑ P(c) F (A) is submodular as well!                        20
Closedness properties
F1,…,Fm submodular functions on V and λ1,…,λm > 0
Then: F(A) = ∑i λi Fi(A) is submodular!


Submodularity closed under nonnegative linear
combinations!

Extremely useful fact!!
  Fθ(A) submodular ) ∑θ P(θ) Fθ(A) submodular!
  Multicriterion optimization:
  F1,…,Fm submodular, λi¸0 ) ∑i λi Fi(A) submodular

                                                      21
Submodularity and Concavity
Suppose g: N ! R and F(A) = g(|A|)
Then F(A) submodular if and only if g concave!
  E.g., g could say “buying in bulk is cheaper”




                                       g(|A|)



                                                  |A|
                                                        22
Maximum of submodular functions
Suppose F1(A) and F2(A) submodular.
Is F(A) = max(F1(A),F2(A)) submodular?


                                F(A) = max(F1(A),F2(A))


                       F1(A)
           F2(A)


                                         |A|
     max(F1,F2) not submodular in general!
                                                          23
Minimum of submodular functions
Well, maybe F(A) = min(F1(A),F2(A)) instead?


         F1(A)   F2(A)   F(A)
                                  F({b}) – F(;)=0
 ;       0       0       0
                                       <
 {a}     1       0       0
                                F({a,b}) – F({a})=1
 {b}     0       1       0
 {a,b}   1       1       1


     min(F1,F2) not submodular in general!

But stay tuned – we’ll address mini Fi later!
                                                      24
Duality
For F submodular on V let G(A) = F(V) – F(VnA)
G is supermodular and called dual to F
Details about properties in [Fujishige ’91]


                             F(A)

                                       |A|


                                       |A|
                             G(A)
                                                 25
Tutorial Overview
Examples and properties of submodular functions
  Many problems submodular (mutual information, influence, …)
  SFs closed under positive linear combinations;
  not under min, max
Submodularity and convexity

Minimizing submodular functions

Maximizing submodular functions

Extensions and research directions
                                                            26
Submodularity
and Convexity




            Carnegie Mellon
Submodularity and convexity
For V = {1,…,n}, and A µ V, let
    wA = (w1A,…,wnA) with
    wiA = 1 if i 2 A, 0 otherwise


Key result [Lovasz ’83]: Every submodular function
F induces a function g on Rn+, such that
  F(A) = g(wA) for all A µ V
  g(w) is convex
  minA F(A) = minw g(w) s.t. w 2 [0,1]n


Let’s see how one can define g(w)
                                                     28
The submodular polyhedron PF
PF = {x 2 Rn: x(A) · F(A) for all A µ V}        Example: V = {a,b}
                                                 A       F(A)
            x(A) = ∑i2 A xi                      ;       0
                                                 {a}     -1
                                                 {b}     2
                                                 {a,b}   0
                                  x{b}
                              2    x({b}) · F({b})
       PF
                              1      x({a,b}) · F({a,b})

                -2    -1      0      1   x{a}

                                  x({a}) · F({a})
                                                                29
Lovasz extension
  Claim: g(w) = maxx2PF wTx
          PF = {x 2 Rn: x(A) · F(A) for all A µ V}
                                     w{b}          xw=argmaxx2 PF wT x
                       xw
                                 2                 g(w)=wT xw
                                            w
                                 1


                  -2        -1   0      1       w{a}




Evaluating g(w) requires solving a linear program with
exponentially many constraints 

                                                                         30
x{b}
                                                     xw
                                                           2
  Evaluating the Lovasz extension                                 w
                                                           1

g(w) = maxx2PF wTx                                 -2 -1   0     1 x{a}

PF = {x 2 Rn: x(A) · F(A) for all A µ V}

Theorem [Edmonds ’71, Lovasz ‘83]:
For any given w, can get optimal solution xw to the LP
using the following greedy algorithm:
   Order V={e1,…,en} so that w(e1)¸ …¸ w(en)       M
   Let    xw(ei) = F({e1,…,ei}) – F({e1,…,ei-1})

Then      wT xw = g(w) = maxx2 PF wT x


Sanity check: If w = wA and A={e1,…,ek}, then
  AT          k                                                       31
Example: Lovasz extension
                                                        A       F(A)
          g(w) = max {wT x: x 2 PF}                     ;       0
                              w{b}                      {a}     -1
                                                        {b}     2
      [-2,2]              2                             {a,b}   0
                              {b}       {a,b}
               [-1,1]    1                               w=[0,1]
                         {}             {a}             want g(w)
            -2      -1    0         1     w{a}
                                                   Greedy ordering:
                                                     e1 = b, e2 = a
                                                   w(e1)=1 > w(e2)=0
g([0,1]) = [0,1]T [-2,2] = 2 = F({b})
                                                    xw(e1)=F({b})-F(;)=2
                                                 xw(e2)=F({b,a})-F({b})=-2
g([1,1]) = [1,1]T [-1,1] = 0 = F({a,b})                  xw=[-2,2]
                                                                         32
Why is this useful?
                                                               x{b}
                                                           2          [0,1]2
 Theorem [Lovasz ’83]:                                     1
 g(w) attains its minimum in [0,1]n at a corner!
                                                           0     1 x{a}


If we can minimize g on [0,1]n, can minimize F…
   (at corners, g and F take same values)
          F(A) submodular               g(w) convex
                                        (and efficient to evaluate)


      Does the converse also hold?
 No, consider g(w1,w2,w3) = max(w1,w2+w3)

            {a} {b} {c} F({a,b})-F({a})=0 < F({a,b,c})-F({a,c})=1        33
Tutorial Overview
Examples and properties of submodular functions
  Many problems submodular (mutual information, influence, …)
  SFs closed under positive linear combinations;
  not under min, max
Submodularity and convexity
  Every SF induces a convex function with SAME minimum
  Special properties: Greedy solves LP over exponential polytope
Minimizing submodular functions

Maximizing submodular functions

Extensions and research directions
                                                                   34
Minimization of
submodular functions




                Carnegie Mellon
Overview minimization
Minimizing general submodular functions



Minimizing symmetric submodular functions



Applications to Machine Learning




                                            36
Minimizing a submodular function
Want to solve      A* = argminA F(A)
Need to solve
   minw maxx wTx                   g(w)
    s.t. w2[0,1]n, x2PF


Equivalently:
    minc,w c
    s.t.   c ¸ wT x for all x2PF
           w2 [0,1]n

                                          37
Ellipsoid algorithm
      [Grötschel, Lovasz, Schrijver ’81]
  Feasible region           Optimality
                            direction




                                    minc,w   c
                                         s.t. c ¸ wT x for all x2PF
                                             w2 [0,1]n
Separation oracle: Find most violated constraint:
    maxx wT x – c s.t. x 2 PF
Can solve separation using the greedy algorithm!!
 Ellipsoid algorithm minimizes SFs in poly-time!                38
Minimizing submodular functions
Ellipsoid algorithm not very practical
Want combinatorial algorithm for minimization!

Theorem [Iwata (2001)]
There is a fully combinatorial, strongly polynomial
algorithm for minimizing SFs, that runs in time
    O(n8 log2 n)

Polynomial-time = Practical ???

                                                      39
A more practical alternative?
       [Fujishige ’91, Fujishige et al ‘06]
    x({a,b})=F({a,b}) x                                      A       F(A)
                        {b}
                                                             ;       0
                           2                                 {a}     -1
                      x*                                     {b}     2

         [-1,1]            1 Base polytope:                  {a,b}   0
                               BF = PF Å {x(V) = F(V)}

         -2      -1        0      1     x{a}



Minimum norm algorithm:                      M
       Find x* = argmin ||x||2 s.t. x 2 BF       x*=[-1,1]
       Return A* = {i: x*(i) < 0}                A*={a}

Theorem [Fujishige ’91]: A* is an optimal solution!
Note: Can solve 1. using Wolfe’s algorithm
Runtime finite but unknown!!                                               40
Empirical comparison
                                                        [Fujishige et al ’06]
                                                                                                   Cut functions
                                                                                                   from DIMACS
Lower is better (log-scale!)



                                                                                                   Challenge
                               Running time (seconds)




                                                                                   Minimum
                                                                                norm algorithm



                                                            64        128       256   512   1024
                     Problem size (log-scale!)
          Minimum norm algorithm orders of magnitude faster!
          Our implementation can solve n = 10k in < 6 minutes!                                                41
Checking optimality (duality)
              BF                                           A       F(A)
                            w{b}                           ;       0
                   x*
                        2                                  {a}     -1
                                                           {b}     2
         [-1,1]         1                                  {a,b}   0
                                         Base polytope:
         -2   -1        0       1 w{a}   BF = PF Å {x(V) = F(V)}

Theorem [Edmonds ’70]
                                                        A = {a}, F(A) = -1
minA F(A) = maxx {x (V) : x 2 BF}
                            –
                                                            w = [1,0]
where x–(s) = min {x(s), 0}                                xw = [-1,1]
                                                           xw- = [-1,0]
Testing how close A’ is to minA F(A)                       xw-(V) = -1
  Run greedy algorithm for w=wA’ to get xw
                                                          A optimal!
  F(A’) ¸ minA F(A) ¸ xw–(V)                                              42
Overview minimization
Minimizing general submodular functions
  Can minimizing in polytime using ellipsoid method
  Combinatorial, strongly polynomial algorithm O(n^8)
  Practical alternative: Minimum norm algorithm?
Minimizing symmetric submodular functions



Applications to Machine Learning




                                                        43
What if we have special structure?
Worst-case complexity of best known algorithm: O(n8 log2n)

Can we do better for special cases?

Example (again): Given RVs X1,…,Xn
F(A) = I(XA; XVnA)
      = I(XVnA ; XA)
      = F(VnA)

Functions F with F(A) = F(VnA) for all A are symmetric
                                                         44
Another example: Cut functions
            2          1               3            V={a,b,c,d,e,f,g,h}
    a              c           e                g
             2                          3
2       2         2        3       3        3
    b              d           f                h
            2          1               3

                 F(A) = ∑ {ws,t: s2 A, t2 Vn A}
                 Example: F({a})=6; F({c,d})=10; F({a,b,c,d})=2

        Cut function is symmetric and submodular!


                                                                          45
Minimizing symmetric functions
For any A, submodularity implies
2 F(A)     = F(A) + F(VnA)
           ¸ F(A Å (VnA))+F(A [ (VnA))
           = F(;) + F(V)
           = 2 F(;) = 0
Hence, any symmetric SF attains minimum at ;
In practice, want nontrivial partition of V into
A and VnA, i.e., require that A is neither ; of V

Want A* = argmin F(A) s.t. 0 < |A| < n

There is an efficient algorithm for doing that!    46
Queyranne’s algorithm (overview)
    [Queyranne ’98]
Theorem: There is a fully combinatorial, strongly         M
polynomial algorithm for solving

    A* = argminA F(A) s.t. 0<|A|<n

for symmetric submodular functions A

Runs in time O(n3) [instead of O(n8)…]



          Note: also works for “posimodular” functions:
      F posimodular  A,Bµ V: F(A)+F(B) ¸ F(AnB)+F(BnA)       47
Gomory Hu trees
            2            1               3                    6       2        9
    a            c               e                g   a           c        e           g
             2                            3
2       2        2           3       3        3           6       7       10       9
    b            d               f                h   b           d        f           h
            2            1               3
                     G                                                T
        A tree T is called Gomory-Hu (GH) tree for SF F
        if for any s, t 2 V it holds that
               min {F(A): s2A and t∉A} =
               min {wi,j: (i,j) is an edge on the s-t path in T}

        “min s-t-cut in T = min s-t-cut in G”
                                                                          Expensive to
        Theorem [Queyranne ‘93]:                                          find one in
        GH-trees exist for any symmetric SF F!                            general!  48
Pendent pairs
   For function F on V, s,t2 V: (s,t) is pendent pair if
        {s} 2 argminA F(A) s.t. s2A, t∉A
   Pendent pairs always exist:
                     6       2        9
             a           c        e           g
                         7                9       Gomory-Hu
                 6               10                 tree T
             b           d        f           h

   Take any leaf s and neighbor t, then (s,t) is pendent!
   E.g., (a,c), (b,c), (f,e), …

Theorem [Queyranne ’95]: Can find pendent pairs in O(n2)
                        (without needing GH-tree!)
                                                              49
Why are pendent pairs useful?
Key idea: Let (s,t) pendent, A* = argmin F(A)
Then EITHER                                          V
  s and t separated by A*, e.g., s2A*, t∉A*.    A*
  But then A*={s}!! OR                          s    t

  s and t are not separated by A*

           V                        V
      A*    s                A*
           t                s t

  Then we can merge s and t…


                                                         50
Merging
            Suppose F is a symmetric SF on V,
            and we want to merge pendent pair (s,t)
            Key idea: “If we pick s, get t for free”
                V’ = Vn{t}
                F’(A)     = F(A[{t})    if s2A, or
                          = F(A) if s∉A

            Lemma: F’ is still symmetric and submodular!
            2         1               3                                     1               3
    a            c            e               g   Merge               a,c           e               g
            2                         3           (a,c)                                     3
2       2        2        3       3       3                   4       4         3       3       3
    b            d            f               h           b           d             f               h
            2         1               3                           2         1               3       51
Queyranne’s algorithm
Input:      symmetric SF F on V, |V|=n
Output:     A* = argmin F(A) s.t. 0 < |A| < n

Initialize F’ Ã F, and V’ Ã V
For i = 1:n-1
     (s,t) Ã pendentPair(F’,V’)
     Ai = {s}
     (F’,V’) Ã merge(F’,V’,s,t)
Return argmini F(Ai)

  Running time: O(n3) function evaluations

                                                52
Note: Finding pendent pairs

Initialize v1 Ã x (x is arbitrary element of V)
For i = 1 to n-1 do
  Wi à {v1,…,vi}
  vi+1 Ã argminv F(Wi[{v}) - F({v}) s.t. v2 VnWi
Return pendent pair (vn-1,vn)


Requires O(n2) evaluations of F



                                                   53
Overview minimization
Minimizing general submodular functions
  Can minimizing in polytime using ellipsoid method
  Combinatorial, strongly polynomial algorithm O(n8)
  Practical alternative: Minimum norm algorithm?
Minimizing symmetric submodular functions
  Many useful submodular functions are symmetric
  Queyranne’s algorithm minimize symmetric SFs in O(n3)
Applications to Machine Learning




                                                          54
Application: Clustering
      [Narasimhan, Jojic, Bilmes NIPS ’05]
                                Group data points V into
                                “homogeneous clusters”
        o      V
                   o            Find a partition V=A1 [ … [ Ak
     o o         o              that minimizes
     o o A1     o A
                   o
              o  o 2
                                  F(A1,…,Ak) = ∑i E(Ai)

                                       “Inhomogeneity of Ai”
                                       Examples for E(A):
                                         Entropy H(A)
                                         Cut function
Special case: k = 2. Then F(A) = E(A) + E(VnA) is symmetric!
If E is submodular, can use Queyranne’s algorithm!          55
What if we want k>2 clusters?
      [Zhao et al ’05, Narasimhan et al ‘05]
                                                                      X2
Greedy Splitting algorithm                  M     X1        X3
                                                                       X5 X7
                                                       X4        X6

Start with partition P = {V}
For i = 1 to k-1                                  X1
                                                         X3
                                                                      X2 X7
  For each member Cj 2 P do                            X4 X6             X5
     split cluster Cj:
     A* = argmin E(A) + E(CjnA) s.t. 0<|A|<|Cj|
     Pj à P n {Cj} [ {A,CjnA}                                X3
                                                  X1                  X2 X7
     Partition we get by splitting j-th cluster               X6
                                                    X4                   X5
  P Ã argminj F(Pj)



Theorem: F(P) · (2-2/k) F(Popt)
                                                                           56
Example: Clustering species
        [Narasimhan et al ‘05]




    Common genetic information = #of common substrings:




Can easily extend to sets of species




                                                          57
Example: Clustering species
    [Narasimhan et al ‘05]
The common genetic information ICG
  does not require alignment
  captures genetic similarity
  is smallest for maximally evolutionarily diverged species
  is a symmetric submodular function! 


Greedy splitting algorithm yields phylogenetic tree!




                                                              58
Example: SNPs
   [Narasimhan et al ‘05]
Study human genetic variation
(for personalized medicine, …)
Most human variation due to point mutations that occur
once in human history at that base location:
   Single Nucleotide Polymorphisms (SNPs)
Cataloging all variation too expensive
($10K-$100K per individual!!)




                                                    59
SNPs in the ACE gene
      [Narasimhan et al ‘05]




Rows: Individuals. Columns: SNPs.
Which columns should we pick to reconstruct the rest?
Can find near-optimal clustering (Queyranne’s algorithm)
                                                           60
Reconstruction accuracy
[Narasimhan et al ‘05]

                          Comparison with
                          clustering based on
                            Entropy
                            Prediction accuracy
                            Pairwise correlation
# of clusters               PCA




                                                   61
Example: Speaker segmentation
           [Reyes-Gomez, Jojic ‘07]                       Region A “Fiona”
                         Mixed waveforms




                                              Frequency
  Alice      “???”



                                                                Time
   Fiona    “???”                                         E(A)=-log p(XA)

                                  Partition          Likelihood of
                                  Spectro-            “region” A
“308”                               gram
                                                  F(A)=E(A)+E(VnA)
                                   using
                                   Q-Algo
                                                             symmetric
“217”                                                      & posimodular
                                                                            62
Example: Image denoising




                           63
Example: Image denoising
                                   Pairwise Markov Random Field
Y1        Y2         Y3

     X1        X2         X3    P(x1,…,xn,y1,…,yn)
                                 = ∏i,j ψi,j(yi,yj) Πi φi(xi,yi)
Y4        Y5         Y6

     X4        X5         X6   Want argmaxy P(y | x)
Y7        Y8         Y9            =argmaxy log P(x,y)
                                   =argminy ∑i,j Ei,j(yi,yj)+∑i Ei(yi)
     X7        X8         X9

 Xi: noisy pixels                         Ei,j(yi,yj) = -log ψi,j(yi,yj)
 Yi: “true” pixels

When is this MAP inference efficiently solvable
   (in high treewidth graphical models)?                                   64
MAP inference in Markov Random Fields
     [Kolmogorov et al, PAMI ’04, see also: Hammer, Ops Res ‘65]


Energy E(y) = ∑i,j Ei,j(yi,yj)+∑i Ei(yi)


Suppose yi are binary, define
    F(A) = E(yA) where yAi = 1 iff i2 A
Then miny E(y) = minA F(A)


Theorem
MAP inference problem solvable by graph cuts
 For all i,j: Ei,j(0,0)+Ei,j(1,1) · Ei,j(0,1)+Ei,j(1,0)
 each Ei,j is submodular
“Efficient if prefer that neighboring pixels have same color”      65
Constrained minimization
Have seen: if F submodular on V, can solve
     A*=argmin F(A)            s.t.     A2V

What about
     A*=argmin F(A)            s.t.     A2V and |A| ·k

E.g., clustering with minimum # points per cluster, …

In general, not much known about constrained minimization 
However, can do
   A*=argmin F(A) s.t. 0<|A|< n
   A*=argmin F(A) s.t. |A| is odd/even [Goemans&Ramakrishnan ‘95]
   A*=argmin F(A) s.t. A 2 argmin G(A) for G submodular [Fujishige ’91]
                                                                          66
Overview minimization
Minimizing general submodular functions
  Can minimizing in polytime using ellipsoid method
  Combinatorial, strongly polynomial algorithm O(n8)
  Practical alternative: Minimum norm algorithm?
Minimizing symmetric submodular functions
  Many useful submodular functions are symmetric
  Queyranne’s algorithm minimize symmetric SFs in O(n3)
Applications to Machine Learning
  Clustering [Narasimhan et al’ 05]
  Speaker segmentation [Reyes-Gomez & Jojic ’07]
  MAP inference [Kolmogorov et al ’04]

                                                          67
Tutorial Overview
Examples and properties of submodular functions
   Many problems submodular (mutual information, influence, …)
   SFs closed under positive linear combinations; not under min, max
Submodularity and convexity
   Every SF induces a convex function with SAME minimum
   Special properties: Greedy solves LP over exponential polytope
Minimizing submodular functions
   Minimization possible in polynomial time (but O(n8)…)
   Queyranne’s algorithm minimizes symmetric SFs in O(n3)
   Useful for clustering, MAP inference, structure learning, …
Maximizing submodular functions

Extensions and research directions
                                                                       68
Maximizing submodular
      functions




               Carnegie Mellon
Maximizing submodular functions

Minimizing convex functions:   Minimizing submodular functions:
Polynomial time solvable!        Polynomial time solvable!



Maximizing convex functions:   Maximizing submodular functions:
        NP hard!                         NP hard!

                                   But can get
                                 approximation
                                  guarantees 

                                                             70
Maximizing influence
      [Kempe, Kleinberg, Tardos KDD ’03]
                     Dorothy                 Eric
Alice                            0.2


               0.5        0.4
   0.3                                 0.2   0.5

Bob
               0.5                                  Fiona
                       Charlie

F(A) = Expected #people influenced when targeting A
F monotonic: If AµB: F(A) ·F(B)
Hence V = argmaxA F(A)
More interesting: argmaxA F(A) – Cost(A)                    71
Maximizing non-monotonic functions
                                                                 maximum
Suppose we want for not monotonic F
    A* = argmax F(A) s.t. AµV
                                                                      |A|
Example:
  F(A) = U(A) – C(A) where U(A) is submodular utility,
  and C(A) is supermodular cost function
  E.g.: Trading off utility and privacy in personalized search
  [Krause & Horvitz AAAI ’08]



In general: NP hard. Moreover:
If F(A) can take negative values:
As hard to approximate as maximum independent set
                                                                      72
                                1-ε
Maximizing positive submodular functions
     [Feige, Mirrokni, Vondrak FOCS ’07]


Theorem
There is an efficient randomized local search procedure,
that, given a positive submodular function F, F(;)=0,
returns set ALS such that
          F(ALS) ¸(2/5) maxA F(A)

picking a random set gives ¼ approximation
(½ approximation if F is symmetric!)
we cannot get better than ¾ approximation unless P = NP


                                                           73
Scalarization vs. constrained maximization
Given monotonic utility F(A) and cost C(A), optimize:
    Option 1:                   Option 2:
    maxA F(A) – C(A)            maxA F(A)
    s.t. A µ V                  s.t. C(A) · B
      “Scalarization”      “Constrained maximization”

    Can get 2/5 approx…        coming up…
    if F(A)-C(A) ¸ 0
    for all A µ V

Positiveness is a
strong requirement                                     74
Constrained maximization: Outline
Monotonic submodular                 Selected set


    Selection cost                              Budget




            Subset selection: C(A) = |A|

 Robust optimization         Complex constraints


                                                         75
Monotonicity
A set function is called monotonic if
    AµBµV ) F(A) · F(B)

Examples:
  Influence in social networks [Kempe et al KDD ’03]
  For discrete RVs, entropy F(A) = H(XA) is monotonic:
  Suppose B=A [C. Then
    F(B) = H(XA, XC) = H(XA) + H(XC | XA) ¸ H(XA) = F(A)
  Information gain: F(A) = H(Y)-H(Y | XA)
  Set cover
  Matroid rank functions (dimension of vector spaces, …)
  …
                                                           76
Subset selection
Given:    Finite set V, monotonic submodular function F, F(;) = 0

Want:     A*µ V such that



            NP-hard!




                                                                    77
Exact maximization of monotonic submodular functions
1) Mixed integer programming [Nemhauser et al ’81]

 max η
 s.t.        η · F(B) + ∑s2VnB αs δ s(B) for all B µ S
         ∑s αs · k
            αs 2 {0,1}

 where δ s(B) = F(B [ {s}) – F(B)

  Solved using constraint generation
2) Branch-and-bound: “Data-correcting algorithm”         M
  [Goldengorin et al ’99]
 Both algorithms worst-case exponential!                 78
Approximate maximization
 Given: finite set V, monotonic submodular function F(A)
 Want:      A*µ V such that



                 NP-hard!                            Y
                                                   “Sick”
Greedy algorithm:                  M
                                            X1       X2       X3
Start with A0 = ;
                                         “Fever”   “Rash”   “Male”
For i = 1 to k
 si := argmaxs F(Ai-1 [ {s}) - F(Ai-1)
 Ai := Ai-1 [ {si}

                                                                     79
Performance of greedy algorithm

Theorem [Nemhauser et al ‘78]
Given a monotonic submodular function F, F(;)=0, the
greedy maximization algorithm returns Agreedy
    F(Agreedy) ¸(1-1/e) max|A|· k F(A)

                  ~63%

                              Sidenote: Greedy algorithm gives
                              1/2 approximation for
                              maximization over any matroid C!
                              [Fisher et al ’78]

                                                                 80
An “elementary” counterexample
X1, X2 ~ Bernoulli(0.5)                   X1           X2
Y = X1 XOR X2

                                                   Y
Let F(A) = IG(XA; Y) = H(Y) – H(Y|XA)


Y | X1 and Y | X2 ~ Bernoulli(0.5) (entropy 1)
Y | X1,X2          is deterministic! (entropy 0)


Hence    F({1,2})    – F({1})    = 1, but
          F({2})     – F(;)      =0
F(A) submodular under some conditions! (later)              81
Example: Submodularity of info-gain
Y1,…,Ym, X1, …, Xn discrete RVs
F(A) = IG(Y; XA) = H(Y)-H(Y | XA)
F(A) is always monotonic
However, NOT always submodular

Theorem [Krause & Guestrin UAI’ 05]
If Xi are all conditionally independent given Y,
then F(A) is submodular!

Y1         Y2        Y3   Hence, greedy algorithm works!

                          In fact, NO algorithm can do better
X1    X2        X3   X4   than (1-1/e) approximation!           82
Building a Sensing Chair
    [Mutlu, Krause, Forlizzi, Guestrin, Hodgins UIST ‘07]
People sit a lot
Activity recognition in
assistive technologies
Seating pressure as
user interface

                      Equipped with
                     1 sensor per cm2!

                     Costs $16,000! 
                                         Lean Lean Slouch
               Can we get similar         left forward
              accuracy with fewer, 82% accuracy on
                cheaper sensors?    10 postures! [Tan et al]83
How to place sensors on a chair?
 Sensor readings at locations V as random variables
 Predict posture Y using probabilistic model P(Y,V)
 Pick sensor locations A* µ V to minimize entropy:

Possible locations V

                            Placed sensors, did a user study:
                                        Accuracy       Cost
                               Before     82%        $16,000 
                               After      79%           $100 

                       Similar accuracy at <1% of the cost!
                                                                 84
Variance reduction
         (a.k.a. Orthogonal matching pursuit, Forward Regression)
 Let Y = ∑i αi Xi+ε, and (X1,…,Xn,ε) » N(¢; µ,Σ)
 Want to pick subset XA to predict Y
 Var(Y | XA=xA): conditional variance of Y given XA = xA
 Expected variance:       Var(Y | XA) = s p(xA) Var(Y | XA=xA) dxA
 Variance reduction: FV(A) = Var(Y) – Var(Y | XA)

 FV(A) is always monotonic
                                                   *under some
 Theorem [Das & Kempe, STOC ’08]                 conditions on Σ
          FV(A) is submodular*
 Orthogonal matching pursuit near optimal!
 [see other analyses by Tropp, Donoho et al., and Temlyakov]
                                                                    85
Batch mode active learning [Hoi et al, ICML’06]
         o o
                               Which data points o should we
          o +
             o
      o            –
        o         o o          label to minimize error?
      +         o
                   o
                o
              – o
              o                 Want batch A of k points to
              o                 show an expert for labeling




F(A) selects examples that are
   uncertain [σ2(s) = π(s) (1-π(s)) is large]
   diverse (points in A are as different as possible)
   relevant (as close to VnA is possible, sT s’ large)
F(A) is submodular and monotonic!
[approximation to improvement in Fisher-information]           86
Results about Active Learning
     [Hoi et al, ICML’06]




Batch mode Active Learning performs better than
  Picking k points at random
  Picking k points of highest entropy
                                                  87
Monitoring water networks
        [Krause et al, J Wat Res Mgt 2008]

    Contamination of drinking water
    could affect millions of people
Contamination
Sensors




         Simulator from EPA                        Hach Sensor
 Place sensors to detect contaminations       ~$14K
 “Battle of the Water Sensor Networks” competition
Where should we place sensors to quickly detect contamination?
                                                             88
Model-based sensing
   Utility of placing sensors based on model of the world
      For water networks: Water flow simulator from EPA
   F(A)=Expected impact reduction placing sensors at A
  Model predicts              Low impact
  High impact
Theorem [Krause et al., J Wat location :
  Contamination               Res Mgt ’08]
Impact reduction F(A) in water networks is submodular!
                                        Medium impact
                     S3           location
                          S1 2
                           S              S3

      S1        S4                                              S2
                                   Sensor reduces
                                                      S4
                                   impact through
             Set V of all          early detection!
                                         S1
             network junctions
    High impact reduction F(A) = 0.9   Low impact reduction F(A)=0.01
                                                                     89
Battle of the Water Sensor Networks Competition
Real metropolitan area network (12,527 nodes)
Water flow simulator provided by EPA
3.6 million contamination events
Multiple objectives:
  Detection time, affected population, …
Place sensors that detect well “on average”




                                                  90
Bounds on optimal solution
                              [Krause et al., J Wat Res Mgt ’08]
                               1.4
                                           Offline
  Population protected F(A)

                               1.2       (Nemhauser)
                                            bound
      Higher is better


                                1

                               0.8

                               0.6                                               Water
                                                                Greedy
                                                                                networks
                               0.4                              solution
                                                                                  data
                               0.2

                                0
                                     0          5       10       15        20

                                           Number of sensors placed
(1-1/e) bound quite loose… can we get better bounds?
                                                                                           91
Data dependent bounds
        [Minoux ’78]
Suppose A is candidate solution to

       argmax F(A) s.t. |A| · k
and A* = {s1,…,sk} be an optimal solution

Then F(A*) · F(A [ A*)
                = F(A)+∑i F(A[{s1,…,si})-F(A[ {s1,…,si-1})
                 · F(A) + ∑i (F(A[{si})-F(A))
                 = F(A) + ∑i δ si

                                                             M
For each s 2 VnA, let δ s = F(A[{s})-F(A)
Order such that δ 1 ¸ δ 2 ¸ … ¸ δ n                              92
Bounds on optimal solution
                       [Krause et al., J Wat Res Mgt ’08]
                        1.4
                                    Offline
                        1.2       (Nemhauser)
Sensing quality F(A)


                                     bound           Data-dependent
  Higher is better


                         1                               bound
                        0.8

                        0.6                                                 Water
                                                        Greedy
                                                                           networks
                        0.4                             solution
                                                                             data
                        0.2

                         0
                              0          5      10       15           20

                                   Number of sensors placed
      Submodularity gives data-dependent bounds on the
      performance of any algorithm                                                    93
BWSN Competition results
                                    [Ostfeld et al., J Wat Res Mgt 2008]
              13 participants
              Performance measured in 30 different criteria
                G: Genetic algorithm D: Domain knowledge
                H: Other heuristic    E: “Exact” method (MIP)
                                                                                               30
                                                                                               25
                                                                               G H   E
          Higher is better




                                                                                               20
Total Score




                                                                     E     G                   15
                                                                 H
                                                             G                                 10

                                          G      H D   D G                                     5
                                                                                               0
                                                                                 .




                                                                                 i

                                                                               l.
                                                                                 .




                                                                                 .
                                                                             an
                                                                                 i




                                                                               s
                                                                                r
                                         ld




                                                                               ll




                                                                             ou




                                                                             sk




                                                                                           .
                                                el




                                                                              al

                                                                              al




                                                                              al
                                                                            lle

                                                                           on




                                                                          ta
                                                                           do




                                                                                         al
                                      tfe

                                              Gu


                                                                          m




                                                                          al
                                                                         rp
                                                                         et

                                                                         et




                                                                         et
                                                                         Pi

                                                                        m
                                                                        rk




                                                                       ie
                                                                       W
                                                                       ht
                                    Os




                                                                                     et
                                                                     ca




                                                                     lo
                                                                   an




                                                                     &
                                                                   Ba




                                                                      g




                                                                   rry
                                                                  rin
                                                                  ac




                                                                 an




                                                                  &
                                                                  ly




                                                                 Sa
                               &




                                                                  o
                                                               Gu




                                                              se
                                                               Be
                                                               Tr




                                                               Po




                                                              Do
                                                                &




                                                               at




                                                                u
                                                             Hu
                                s




                                                             &

                                                             W

                    24% better performance than runner-up! 
                                                          op
                             ei




                                                          i re




                                                        au
                                                           &




                                                         ld
                        Pr




                                                       Pr
                                                      im




                                                       s




                                                     tfe




                                                    Kr
                                                    de
                                                   Gh




                                                Os
                                                  ia
                                               El




                                                                                                    94
What was the trick?
Simulated all 3.6M contaminations on 2 weeks / 40 processors
152 GB data on disk, 16 GB in main memory (compressed)
 Very accurate computation of F(A) Very slow evaluation of F(A) 
                                                                               30 hours/20 sensors
                  Running time (minutes)




                                           300
Lower is better




                                                     Exhaustive search
                                                       (All subsets)
                                                                                6 weeks for all
                                           200
                                                                                30 settings 
                                                            Naive
                                                            greedy
                                                                                   ubmodularity
                                           100
                                                                                 to the rescue
                                             0
                                                 1 2 3 4 5 6 7 8 9 10
                                                  Number of sensors selected


                                                                                                     95
Scaling up greedy algorithm
    [Minoux ’78]
In round i+1,
  have picked Ai = {s1,…,si}
  pick si+1 = argmaxs F(Ai [ {s})-F(Ai)
I.e., maximize “marginal benefit” δ s(Ai)

    δ s(Ai) = F(Ai [ {s})-F(Ai)

Key observation: Submodularity implies
                                            δ s(Ai) ¸ δ s(Ai+1)
    i · j ) δ s(Ai) ¸ δ s(Aj)               s



Marginal benefits can never increase!
                                                                  96
“Lazy” greedy algorithm
        [Minoux ’78]


  Lazy greedy algorithm:                M
      First iteration as usual                      Benefit δ s(A)
                                                 a
       Keep an ordered list of marginal
       benefits δ i from previous iteration      b
                                                 d
       Re-evaluate δ i only for top              b
                                                 c
       element
                                                 e
                                                 d
       If δ i stays on top, use it,
                                                 c
                                                 e
       otherwise re-sort


Note: Very easy to compute online bounds, lazy evaluations, etc.
      [Leskovec et al. ’07]
                                                                      97
Result of lazy evaluation
Simulated all 3.6M contaminations on 2 weeks / 40 processors
152 GB data on disk, 16 GB in main memory (compressed)
 Very accurate computation of F(A) Very slow evaluation of F(A) 
                                                                                   30 hours/20 sensors
                  Running time (minutes)




                                           300
Lower is better




                                                     Exhaustive search
                                                       (All subsets)
                                                                                     6 weeks for all
                                           200
                                                                                     30 settings 
                                                            Naive
                                                            greedy
                                                                                       ubmodularity
                                           100                       Fast greedy
                                                                                     to the rescue:
                                             0                                     Using “lazy evaluations”:
                                                 1 2 3 4 5 6 7 8 9 10
                                                  Number of sensors selected       1 hour/20 sensors
                                                                                   Done after 2 days! 
                                                                                                           98
What about worst-case?
      [Krause et al., NIPS ’07]
 Knowing the sensor locations, an
  adversary contaminates here!


                     S3
                          S2           S3

                S4                                            S2
       S1
                                                 S4


                                       S1

        Placement detects        Very different average-case impact,
      well on “average-case”           Same worst-case impact
    (accidental) contamination
Where should we place sensors to quickly detect in the worst case?
                                                                   99
Constrained maximization: Outline
 Utility function                      Selected set


   Selection cost                                 Budget




                    Subset selection

Robust optimization            Complex constraints


                                                           100
Optimizing for the worst case
  Separate utility function Fi for each contamination i
  Fi(A) = impact reduction by sensors A for contamination i
                                 Contamination
                                 at node s                Sensors A
  Want to solve                    (A)
                                 Fs(B) is high
                                  Sensors B




                                                    Contamination
Each of the Fi is submodular                        at node r
Unfortunately, mini Fi not submodular!              Fr(A) is high
                                                      (B) low
                                                                 101
How can we solve this robust optimization problem?
How does the greedy algorithm do?
V={     ,    ,   }
Can only buy k=2      Set A  F1   F2 mini Fi
                             1    0    0
                                                      Greedy picks
                             0    2    0                  first
      Optimal                ε    ε    ε
                      Hence we can’t find any           Then, can
      solution               1    ε    ε
                                                       choose only
                     approximation algorithm.
                             ε    2    ε                    or
Optimal score: 1              Or can 2 1
                               1     we?              Greedy score: ε
  Greedy does arbitrarily badly. Is there something better?
  Theorem [NIPS ’07]: The problem max|A|· k mini Fi(A)
  does not admit any approximation unless P=NP
                                                                  102
Alternative formulation
If somebody told us the optimal value,


can we recover the optimal solution A*?

Need to find


Is this any easier?

Yes, if we relax the constraint |A| · k
                                          103
Solving the alternative problem
  Trick: For each Fi and c, define truncation
                                                    Fi(A)
                                         c
   Remains                                          F’i,c(A)
submodular!

                                                    |A|

  Problem 1 (last slide)                Problem 2




  Non-submodular  optimal solutions!Submodular!
               Same
Don’t know how to solve solves the other as constraint?
             Solving one        But appears
                                                            104
Maximization vs. coverage
  Previously: Wanted
      A* = argmax F(A) s.t. |A| · k

  Now need to solve:
      A* = argmin |A| s.t. F(A) ¸ Q
                                  M
  Greedy algorithm:
    Start with A := ;;                 For bound, assume
    While F(A) < Q and |A|< n          F is integral.
       s* := argmaxs F(A [ {s})        If not, just round it.
       A := A [ {s*}


Theorem [Wolsey et al]: Greedy will return Agreedy
  |A   | · (1+log max F({s})) |A |                              105
Solving the alternative problem
  Trick: For each Fi and c, define truncation
                                                     Fi(A)
                                         c
                                                     F’i,c(A)


                                                     |A|

  Problem 1 (last slide)                Problem 2




  Non-submodular                        Submodular!
Don’t know how to solve            Can use greedy algorithm!
                                                             106
Back to our example

 Guess c=1            Set A   F1   F2   mini Fi   F’avg,1
 First pick                   1    0    0         ½
 Then pick                    0    2    0         ½
 Optimal solution!           ε    ε    ε         ε
                              1    ε    ε         (1+ε)/2
                              ε    2    ε         (1+ε)/2

                              1    2    1         1


  How do we find c?
  Do binary search!
                                                            107
SATURATE Algorithm
         [Krause et al, NIPS ‘07]
Given: set V, integer k and monotonic SFs F1,…,Fm   M
Initialize cmin=0, cmax = mini Fi(V)
Do binary search: c = (cmin+cmax)/2
   Greedily find AG such that F’avg,c(AG) = c
   If |AG| · α k: increase cmin
   If |AG| > α k: decrease cmax
until convergence

                                                        Truncation
                                                        threshold
                                                        (color)


                                                              108
Theoretical guarantees
    [Krause et al, NIPS ‘07]

Theorem: The problem max|A|· k mini Fi(A)
does not admit any approximation unless P=NP 
Theorem: SATURATE finds a solution AS such that

     mini Fi(AS) ¸ OPTk and |AS| · α k

where       OPTk = max|A|·k mini Fi(A)
               α = 1 + log maxs ∑i Fi({s})
Theorem:
If there were a polytime algorithm with better factor
β < α, then NP µ DTIME(nlog log n)
                                                        109
Example: Lake monitoring
    Monitor pH values using robotic sensor
                                 transect

                                        Prediction at unobserved
                       Observations A           locations
pH value




                                           True (hidden) pH values

             Var(s | A)                    Use probabilistic model
                                             (Gaussian processes)
           Position s along transect     to estimate prediction error

     Where should we sense to minimize our maximum error?

             Robust submodular
                                                    (often) submodular
            optimization problem!                    [Das & Kempe ’08] 110
Comparison with state of the art
         Algorithm used in geostatistics: Simulated Annealing
                                   [Sacks & Schiller ’88, van Groeningen & Stein ’98, Wiens ’05,…]
                                   7 parameters that need to be fine-tuned
                                   0.25




                                                                          Maximum marginal variance
                             nce




                                                                                                      2.5
                                    0.2
           a um arginal varia
better




                                                      G dy
                                                       ree                                                       Greedy
                                                                                                       2
                                   0.15

                                                        SATURATE                                      1.5                         Simulated
          M xim m




                                    0.1                                                                                           Annealing
                                                                                                       1
                                   0.05   S u
                                           im lated
                                          A n alin
                                           ne g                                                              SATURATE
                                                                                                      0.5
                                     0
                                      0        20         40        60                                   0      20      40      60       80   100

           SATURATE is competitive & 10x faster
                                                N m er of se sors
                                                 u b        n                                                        Number of sensors

   Environmental monitoring          Precipitation data
                    No parameters to tune!
                                                                                                                                               111
Results on water networks
                                                     3000
                  Maximum detection time (minutes)                                       No decrease
                                                     2500                                until all
                                                                        Greedy
Lower is better



                                                     2000
                                                                                         contaminations
                                                                            Simulated    detected!
                                                     1500                   Annealing

                                                     1000

                                                      500       SATURATE
                                                        0
                                                         0     10              20       Water networks
                                                             Number of sensors


                  60% lower worst-case detection time!

                                                                                                     112
Worst- vs. average case
Given: Set V, submodular functions F1,…,Fm

          Average-case score         Worst-case score



           Too optimistic?           Very pessimistic!


Want to optimize both average- and worst-case score!

Can modify SATURATE to solve this problem! 
  Want:      Fac(A) ¸ cac and Fwc(A) ¸ cwc
  Truncate: min{Fac(A),cac} + min{Fwc(A),cwc} ¸ cac+cwc   113
Worst- vs. average case
                    7000
                                                      Only optimize for
                    6000                               average case
Worst case impact
  lower is better


                    5000
                                                              Knee in
                    4000
                                                           tradeoff curve          Water
                    3000        Tradeoffs                                         networks
                               (SATURATE)
                    2000
                                                                                    data

                    1000          Only optimize for worst case
                       0
                           0      50   100    150    200    250   300       350
                    Can findAverage case impact
                             good compromise between
                       average- and is better
                               lower worst-case score!

                                                                                             114
Constrained maximization: Outline
 Utility function                      Selected set


   Selection cost                                 Budget




                    Subset selection

Robust optimization            Complex constraints


                                                           115
Other aspects: Complex constraints
   maxA F(A) or maxA mini Fi(A) subject to
      So far: |A| · k
      In practice, more complex constraints:
      Different costs: C(A) · B

 Locations need to be             Sensors need to communicate
  connected by paths                  (form a routing tree)
 [Chekuri & Pal, FOCS ’05]
   [Singh et al, IJCAI ’07]




Lake monitoring                                Building monitoring

                                                                     116
Non-constant cost functions
For each s 2 V, let c(s)>0 be its cost
(e.g., feature acquisition costs, …)
Cost of a set C(A) = ∑s2 A c(s) (modular function!)
Want to solve

    A* = argmax F(A) s.t. C(A) · B

Cost-benefit greedy algorithm:                M
  Start with A := ;;
  While there is an s2VnA s.t. C(A[{s}) · B



     A := A [ {s*}
                                                      117
Performance of cost-benefit greedy
Want                           Set A   F(A)   C(A)
                               {a}     2ε     ε
maxA F(A) s.t. C(A)· 1         {b}     1      1




Cost-benefit greedy picks a.
Then cannot afford b!

 Cost-benefit greedy performs arbitrarily badly!


                                                     118
Cost-benefit optimization
     [Wolsey ’82, Sviridenko ’04, Leskovec et al ’07]
Theorem [Leskovec et al. KDD ‘07]
   ACB: cost-benefit greedy solution and
   AUC: unit-cost greedy solution (i.e., ignore costs)
Then
       max { F(ACB), F(AUC) } ¸ ½ (1-1/e) OPT


Can still compute online bounds and
speed up using lazy evaluations

Note: Can also get
   (1-1/e) approximation in time O(n4)              [Sviridenko ’04]
   Slightly better than ½ (1-1/e) in O(n2)          [Wolsey ‘82]


                                                                       119
Example: Cascades in the Blogosphere
        [Leskovec, Krause, Guestrin, Faloutsos, VanBriesen, Glance ‘07]
Time




                                                        Learn about
                                                       story after us!




                          Information
                            cascade
  Which blogs should we read to learn about big cascades early? 120
Water vs. Web




         Placing sensors in                           Selecting
          water networks             vs.          informative blogs
In both problems we are given
   Graph with nodes (junctions / blogs) and edges (pipes / links)
   Cascades spreading dynamically over the graph (contamination / citations)
Want to pick nodes to detect big cascades early
In both applications, utility functions submodular 
      [Generalizes Kempe et al, KDD ’03]
                                                                         121
Performance on Blog selection
                                       0.7                                                                                        400




                                                                                               Lower is better
                                                        Greedy
Higher is better




                                                                                                             Running time (seconds)
                                       0.6                                                                                                     Exhaustive search
                   Cascades captured



                                                                                                                                  300            (All subsets)
                                       0.5

                                       0.4                                                                                                               Naive
                                                                                                                                  200                    greedy
                                                                    In-links
                                       0.3              All outlinks
                                       0.2                                  # Posts                                               100
                                                                                                                                                                   Fast greedy
                                       0.1                                      Random
                                                                                                                                      0
                                        0
                                             0     20        40        60        80      100                                           1   2     3  4    5    6   7    8    9     10
                                                        Number of blogs                                                                          Number of blogs selected

                                                 Blog selection                                                                         Blog selection
                                                 ~45k blogs
                       Outperforms state-of-the-art heuristics
                       700x speedup using submodularity!
                                                                                                                                                                            122
Cost of reading a blog                                                 skip

  Naïve approach: Just pick 10 best blogs
  Selects big, well known blogs (Instapundit, etc.)
  These contain many posts, take long to read!
                                     Cost/benefit
                               0.6
           Cascades captured


                                       analysis

                               0.4
                                                Ignoring cost
                               0.2


                                0
                                 0
                                 0     2 1 4    26   8 3 10     412      5
                                                                        14
                                                                         4
                                                                      x10
                                Cost(A) = Number of posts / day

Cost-benefit optimization picks summarizer blogs!
                                                                                    123
Predicting the “hot” blogs
                       Want blogs that will be informative in the future
                       Split data set; train on historic, test on future
                                                                                        Detects on training set
                    0.25 Greedy on future                                                            G edy
                                                                                                      re
                           Test on future
Cascades captured




                                                                              s
                         “Cheating”




                                                                         ction
                     0.2
                                                                                  200




                                                                    #dete
                    0.15
                                                                                   0
                                                                                        Jan    Feb      M ar   Ar
                                                                                                                p   May
                     0.1                     Greedy on historic
                                              Test on future                         Blog selection “overfits”
                                                                                     Detect well
                                                                                                     S rate
                                                                                                      atu
                                                                                                    Detect poorly
                                                                              s
                    0.05
                                                                    #detection    200
                                                                                       here! training data!
                                                                                          to            here!
                      0
                          0
                          0   2 1000
                                  4     6 2000
                                             8    10
                                                   3000 12   14
                                                             4000                   Poor generalization!
                     Cost(A) = Number of posts / day                                0
                                                                                         Want blogs that May
                                                                                        JaWhy’s that? Apr
                                                                                          n    F eb Mar
                                        Let’s see what                                continue to do well!
                                        goes wrong here.
                                                                                                                    124
Robust optimization                                        G edy
                                                                  re

“Overfit” blog




                                 ctions
                                          200
 selection A



                            #dete
                                           0
                                                  Jan      Feb       M ar   Ar
                                                                             p     May
Fi(A) = detections                                               S rate
                                                                   atu
                                                F1(A)=.5      F (A)=.6 F5 (A)=.02
                                                                 G edy
                                                                  re
                                                                  3
        in interval i
                         #dete s s
                        #dete ction
                             ction
                                                        F2 (A)=.8 F4(A)=.01
                                       200
                                      200

 Optimize                                   0
                                                  Jan       Feb     Mar     A pr   M ay
                                           0
 worst-case                                      Jan       Feb    M ar
                                                               S rate
                                                                atu
                                                                            Ar
                                                                             p     May
                                  s




                                               Detections using SATURATE
                             ction




“Robust” blog                         200
                        #dete




 selection A*                              0
                                                  Jan      Feb      Mar     Apr    May


         Robust optimization  Regularization!
                                                                                          125
Predicting the “hot” blogs

                     0.25 Greedy on future
                            Test on future        Robust solution
 Cascades captured        “Cheating”
                      0.2                         Test on future


                     0.15

                      0.1                     Greedy on historic
                                               Test on future
                     0.05

                       0
                        00    2     4
                                  1000   6 2000
                                              8     10
                                                     300012   14
                                                              4000

                      Cost(A) = Number of posts / day


         50% better generalization!
                                                                     126
Other aspects: Complex constraints                        skip
   maxA F(A) or maxA mini Fi(A) subject to
      So far:                 |A| · k
      In practice, more complex constraints:
      Different costs: C(A) · B

                                  Sensors need to communicate
 Locations need to be                 (form a routing tree)
  connected by paths
 [Chekuri & Pal, FOCS ’05]
   [Singh et al, IJCAI ’07]




                                               Building monitoring
Lake monitoring
                                                                     127
Naïve approach: Greedy-connect                                                                                                     long

  Simple heuristic: Greedily optimize submodular utility function F(A)
  Then add nodes to minimize communication cost C(A)

          50
                    1    51
                                OFFICE

                                              2   2
                                              OFFICE          54           9
                                                                                              11
                                                                                                      12
                                                                                                                QUIET   PHONE

                                                                                                                                 15
                                                                                                                                                16



                                           1.5
                                                                      8
                                                        53

                                          relayF(A) =
                                     52                                            10                      13            14

                                     No communication 0.2
                                                   CONFERENCE
     49
                                                                                                                                                17


                                      1
                                                                           7


   Most
                                                                                                                                 18


                                                                                                                                                      F(A) = 4
                                                        STORAGE



                         48
                                                       ELEC    COPY   5 node
                                                                      possible!    6
                                                                                               LAB
                                                                                                                                           19



        11
informative
                                                                         2 C(A) 1 3
     47


                                                                      C(A) = 1  =                                                                     C(A) = 10
                                                                           4                                                                     20
                          46                                                                                                          21

                                                                                   3
               45
                                                                      2
                                                                                             SERVER
                                                                                                                                                22
    44
                     relay
                         43
                                         39
                                                         KITCHEN



                                                        37
                                                                               1
                                                                                   33
                                                                                                            Second
                                                                                                           29            27
                                                                                                                                23




                   = node                                                                    C(A) 2 3.5 most informative
                                                                      35



              F(A)2 3.5                                                                           =
                               40                                                            31


                                                                                                               efficient
                                                                                                                                      25        24
         42         41                                                     34
                                         38                    36                       32            30        28         26



                                                                                                                                                       communication!
                                                                                                                                                       Very informative,
Communication cost = Expected # of trials                                                                                                                   Not very
                                                                                                                                                      High communication
  (learned using Gaussian Processes)
Want to find optimal tradeoff                                                                                                                            informative 
                                                                                                                                                             cost! 
between information and communication cost
                                                                                                                                                                           128
The pSPIEL Algorithm
         [Krause, Guestrin, Gupta, Kleinberg IPSN 2006]
    pSPIEL: Efficient nonmyopic algorithm
    (padded Sensor Placements at Informative and cost-
    Effective Locations)


1                             2       Decompose sensing region
         2
                          1           into small, well-separated
         C1
                              C2      clusters
                                      Solve cardinality constrained
                      3           2   problem per cluster (greedy)
1    3    C4                  C3      Combine solutions using
          2       1
                                      k-MST algorithm
                                                                129
Submodularity slides
Submodularity slides
Submodularity slides
Submodularity slides
Submodularity slides
Submodularity slides
Submodularity slides
Submodularity slides
Submodularity slides
Submodularity slides
Submodularity slides
Submodularity slides
Submodularity slides
Submodularity slides
Submodularity slides
Submodularity slides
Submodularity slides
Submodularity slides
Submodularity slides
Submodularity slides
Submodularity slides

Contenu connexe

Tendances

Tendances (20)

A developers guide to machine learning
A developers guide to machine learningA developers guide to machine learning
A developers guide to machine learning
 
Desicion tree and neural networks
Desicion tree and neural networksDesicion tree and neural networks
Desicion tree and neural networks
 
ベルヌーイ分布における超パラメータ推定のための経験ベイズ法
ベルヌーイ分布における超パラメータ推定のための経験ベイズ法ベルヌーイ分布における超パラメータ推定のための経験ベイズ法
ベルヌーイ分布における超パラメータ推定のための経験ベイズ法
 
PRML Chapter 5
PRML Chapter 5PRML Chapter 5
PRML Chapter 5
 
Optimization as a model for few shot learning
Optimization as a model for few shot learningOptimization as a model for few shot learning
Optimization as a model for few shot learning
 
Deep Generative Models
Deep Generative ModelsDeep Generative Models
Deep Generative Models
 
십분딥러닝_17_DIM(Deep InfoMax)
십분딥러닝_17_DIM(Deep InfoMax)십분딥러닝_17_DIM(Deep InfoMax)
십분딥러닝_17_DIM(Deep InfoMax)
 
機械学習 / Deep Learning 大全 (2) Deep Learning 基礎編
機械学習 / Deep Learning 大全 (2) Deep Learning 基礎編機械学習 / Deep Learning 大全 (2) Deep Learning 基礎編
機械学習 / Deep Learning 大全 (2) Deep Learning 基礎編
 
Chapter2.3.6
Chapter2.3.6Chapter2.3.6
Chapter2.3.6
 
【DL輪読会】Universal Trading for Order Execution with Oracle Policy Distillation
【DL輪読会】Universal Trading for Order Execution with Oracle Policy Distillation【DL輪読会】Universal Trading for Order Execution with Oracle Policy Distillation
【DL輪読会】Universal Trading for Order Execution with Oracle Policy Distillation
 
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Idea Behind Recurre...
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Idea Behind Recurre...Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Idea Behind Recurre...
Deep Learning A-Z™: Recurrent Neural Networks (RNN) - The Idea Behind Recurre...
 
深層強化学習でマルチエージェント学習(後篇)
深層強化学習でマルチエージェント学習(後篇)深層強化学習でマルチエージェント学習(後篇)
深層強化学習でマルチエージェント学習(後篇)
 
Hyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement LearningHyperbolic Deep Reinforcement Learning
Hyperbolic Deep Reinforcement Learning
 
Colin Barre-Brisebois - GDC 2011 - Approximating Translucency for a Fast, Che...
Colin Barre-Brisebois - GDC 2011 - Approximating Translucency for a Fast, Che...Colin Barre-Brisebois - GDC 2011 - Approximating Translucency for a Fast, Che...
Colin Barre-Brisebois - GDC 2011 - Approximating Translucency for a Fast, Che...
 
ゼロから作るDeepLearning 2~3章 輪読
ゼロから作るDeepLearning 2~3章 輪読ゼロから作るDeepLearning 2~3章 輪読
ゼロから作るDeepLearning 2~3章 輪読
 
Precomputed atmospheric scattering(사전 계산 대기 산란)
Precomputed atmospheric scattering(사전 계산 대기 산란)Precomputed atmospheric scattering(사전 계산 대기 산란)
Precomputed atmospheric scattering(사전 계산 대기 산란)
 
Decision Trees
Decision TreesDecision Trees
Decision Trees
 
유니티의 라이팅이 안 이쁘다구요? (A to Z of Lighting)
유니티의 라이팅이 안 이쁘다구요? (A to Z of Lighting)유니티의 라이팅이 안 이쁘다구요? (A to Z of Lighting)
유니티의 라이팅이 안 이쁘다구요? (A to Z of Lighting)
 
Deep Learning Theory Seminar (Chap 1-2, part 1)
Deep Learning Theory Seminar (Chap 1-2, part 1)Deep Learning Theory Seminar (Chap 1-2, part 1)
Deep Learning Theory Seminar (Chap 1-2, part 1)
 
論文紹介:Dueling network architectures for deep reinforcement learning
論文紹介:Dueling network architectures for deep reinforcement learning論文紹介:Dueling network architectures for deep reinforcement learning
論文紹介:Dueling network architectures for deep reinforcement learning
 

Similaire à Submodularity slides

Functions for Grade 10
Functions for Grade 10Functions for Grade 10
Functions for Grade 10
Boipelo Radebe
 
Machine learning of structured outputs
Machine learning of structured outputsMachine learning of structured outputs
Machine learning of structured outputs
zukun
 
CVPR2010: higher order models in computer vision: Part 1, 2
CVPR2010: higher order models in computer vision: Part 1, 2CVPR2010: higher order models in computer vision: Part 1, 2
CVPR2010: higher order models in computer vision: Part 1, 2
zukun
 
Object Detection with Discrmininatively Trained Part based Models
Object Detection with Discrmininatively Trained Part based ModelsObject Detection with Discrmininatively Trained Part based Models
Object Detection with Discrmininatively Trained Part based Models
zukun
 

Similaire à Submodularity slides (20)

Functions for Grade 10
Functions for Grade 10Functions for Grade 10
Functions for Grade 10
 
Kernels and Support Vector Machines
Kernels and Support Vector  MachinesKernels and Support Vector  Machines
Kernels and Support Vector Machines
 
ijcai09submodularity.ppt
ijcai09submodularity.pptijcai09submodularity.ppt
ijcai09submodularity.ppt
 
Machine learning of structured outputs
Machine learning of structured outputsMachine learning of structured outputs
Machine learning of structured outputs
 
Functions.ppt
Functions.pptFunctions.ppt
Functions.ppt
 
Functions.ppt
Functions.pptFunctions.ppt
Functions.ppt
 
UMAP - Mathematics and implementational details
UMAP - Mathematics and implementational detailsUMAP - Mathematics and implementational details
UMAP - Mathematics and implementational details
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Fp in scala part 2
Fp in scala part 2Fp in scala part 2
Fp in scala part 2
 
Good functional programming is good programming
Good functional programming is good programmingGood functional programming is good programming
Good functional programming is good programming
 
CVPR2010: higher order models in computer vision: Part 1, 2
CVPR2010: higher order models in computer vision: Part 1, 2CVPR2010: higher order models in computer vision: Part 1, 2
CVPR2010: higher order models in computer vision: Part 1, 2
 
Object Detection with Discrmininatively Trained Part based Models
Object Detection with Discrmininatively Trained Part based ModelsObject Detection with Discrmininatively Trained Part based Models
Object Detection with Discrmininatively Trained Part based Models
 
Functionsandpigeonholeprinciple
FunctionsandpigeonholeprincipleFunctionsandpigeonholeprinciple
Functionsandpigeonholeprinciple
 
gfg
gfggfg
gfg
 
Truth, deduction, computation lecture f
Truth, deduction, computation   lecture fTruth, deduction, computation   lecture f
Truth, deduction, computation lecture f
 
Functions (1)
Functions (1)Functions (1)
Functions (1)
 
5.3 Basic functions. Dynamic slides.
5.3 Basic functions. Dynamic slides.5.3 Basic functions. Dynamic slides.
5.3 Basic functions. Dynamic slides.
 
Functions (Theory)
Functions (Theory)Functions (Theory)
Functions (Theory)
 
237654933 mathematics-t-form-6
237654933 mathematics-t-form-6237654933 mathematics-t-form-6
237654933 mathematics-t-form-6
 
Exponents)
Exponents)Exponents)
Exponents)
 

Dernier

Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
PECB
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
kauryashika82
 

Dernier (20)

Sociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning ExhibitSociology 101 Demonstration of Learning Exhibit
Sociology 101 Demonstration of Learning Exhibit
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
psychiatric nursing HISTORY COLLECTION .docx
psychiatric  nursing HISTORY  COLLECTION  .docxpsychiatric  nursing HISTORY  COLLECTION  .docx
psychiatric nursing HISTORY COLLECTION .docx
 
Class 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdfClass 11th Physics NEET formula sheet pdf
Class 11th Physics NEET formula sheet pdf
 
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptxINDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
INDIA QUIZ 2024 RLAC DELHI UNIVERSITY.pptx
 
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural ResourcesEnergy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
Energy Resources. ( B. Pharmacy, 1st Year, Sem-II) Natural Resources
 
Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17Advanced Views - Calendar View in Odoo 17
Advanced Views - Calendar View in Odoo 17
 
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
Explore beautiful and ugly buildings. Mathematics helps us create beautiful d...
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptxBasic Civil Engineering first year Notes- Chapter 4 Building.pptx
Basic Civil Engineering first year Notes- Chapter 4 Building.pptx
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
2024-NATIONAL-LEARNING-CAMP-AND-OTHER.pptx
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in DelhiRussian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
Russian Escort Service in Delhi 11k Hotel Foreigner Russian Call Girls in Delhi
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 

Submodularity slides

  • 1. Beyond Convexity – Submodularity in Machine Learning Andreas Krause, Carlos Guestrin Carnegie Mellon University International Conference on Machine Learning | July 5, 2008 Carnegie Mellon
  • 2. Acknowledgements Thanks for slides and material to Mukund Narasimhan, Jure Leskovec and Manuel Reyes Gomez MATLAB Toolbox and details for references available at http://www.submodularity.org Algorithms implemented M 2
  • 3. Optimization in Machine Learning + Classify + from – by w1 finding a separating + + - - hyperplane (parameters w) + + - - w* - - Which one should we choose? w2 Define loss L(w) = “1/size of margin”  Solve for best vector w* = argminw L(w) L(w) Key observation: Many problems in ML are convex! w  no local minima!!  3
  • 4. Feature selection Given random variables Y, X1, … Xn Want to predict Y from subset XA = (Xi1,…,Xik) Y “Sick” Naïve Bayes X1 X2 X3 Model “Fever” “Rash” “Male” Want k most informative features: A* = argmax IG(XA; Y) s.t. |A| · k Uncertainty Uncertainty before knowing XA after knowing XA where IG(XA; Y) = H(Y) - H(Y | XA) 4
  • 5. Factoring distributions V Given random variables X1,…,Xn X1 X3 X2 X4 X6 X5 X7 Partition variables V into sets A and VnA as independent as possible X1 X3 X2 X7 X4 X6 X5 Formally: Want A VnA A* = argminA I(XA; XVnA) s.t. 0<|A|<n where I(XA,XB) = H(XB) - H(XB j XA) Fundamental building block in structure learning [Narasimhan&Bilmes, UAI ’04] 5
  • 6. Combinatorial problems in ML Given a (finite) set V, function F: 2V ! R, want A* = argmin F(A) s.t. some constraints on A Solving combinatorial problems: Mixed integer programming? Often difficult to scale to large problems Relaxations? (e.g., L1 regularization, etc.) Not clear when they work This talk: Fully combinatorial algorithms (spanning tree, matching, …) Exploit problem structure to get guarantees about solution! 6
  • 7. Example: Greedy algorithm for feature selection Given: finite set V of features, utility function F(A) = IG(XA; Y) Want: A*µ V such that Y “Sick” NP-hard! X1 X2 X3 Greedy algorithm: M “Fever” “Rash” “Male” Start with A = ; For i = 1 to k s* := argmaxs F(A [ {s}) A := A [ {s*} How well can this simple heuristic do? 7
  • 8. Key property: Diminishing returns Selection A = {} Selection B = {X2,X3} Y Y “Sick” “Sick” X2 X3 “Rash” “Male” Adding X1 X1 Adding X1 Theorem [Krause, Guestrin UAI ‘05]: Information gain F(A) in “Fever” will help a lot! doesn’t help much Naïve Bayes models is submodular! New feature X1 Submodularity: B A + s Large improvement + s Small improvement For Aµ B, F(A [ {s}) – F(A) ¸ F(B [ {s}) – F(B) 8
  • 9. Why is submodularity useful? Theorem [Nemhauser et al ‘78] Greedy maximization algorithm returns Agreedy: F(Agreedy) ¸(1-1/e) max|A|·k F(A) ~63% Greedy algorithm gives near-optimal solution! More details and exact statement later For info-gain: Guarantees best possible unless P = NP! [Krause, Guestrin UAI ’05] 9
  • 10. Submodularity in Machine Learning In this tutorial we will see that many ML problems are submodular, i.e., for F submodular require: Minimization: A* = argmin F(A) Structure learning (A* = argmin I(XA; XVnA)) Clustering MAP inference in Markov Random Fields … Maximization: A* = argmax F(A) Feature selection Active learning Ranking … 10
  • 11. Tutorial Overview  Examples and properties of submodular functions  Submodularity and convexity  Minimizing submodular functions  Maximizing submodular functions  Research directions, … LOTS of applications to Machine Learning!! 11
  • 13. Set functions Finite set V = {1,2,…,n} Function F: 2V ! R Will always assume F(;) = 0 (w.l.o.g.) Assume black-box that can evaluate F for any input A Approximate (noisy) evaluation of F is ok (e.g., [37]) Example: F(A) = IG(XA; Y) = H(Y) – H(Y | XA) = ∑y,x P(xA) [log P(y | xA) – log P(y)] A Y Y “Sick” “Sick” X1 X2 X2 X3 “Fever” “Rash” “Rash” “Male” F({X1,X2}) = 0.9 F({X2,X3}) = 0.5 13
  • 14. Submodular set functions Set function F on V is called submodular if For all A,B µ V: F(A)+F(B) ¸ F(A[B)+F(AÅB) + ¸ + A A [ BB AÅB Equivalent diminishing returns characterization: Submodularity: B A + S Large improvement + S Small improvement For AµB, s∉B, F(A [ {s}) – F(A) ¸ F(B [ {s}) – F(B) 14
  • 15. Submodularity and supermodularity Set function F on V is called submodular if 1) For all A,B µ V: F(A)+F(B) ¸ F(A[B)+F(AÅB)  2) For all AµB, s∉B, F(A [ {s}) – F(A) ¸ F(B [ {s}) – F(B) F is called supermodular if –F is submodular F is called modular if F is both sub- and supermodular for modular (“additive”) F, F(A) = ∑i2A w(i) 15
  • 16. Example: Set cover Place sensors Want to cover floorplan with discs in building Possible OFFICE OFFICE QUIET PHONE locations CONFERENCE STORAGE V LAB ELEC COPY SERVER KITCHEN Node predicts For A µ V: F(A) = “area values of positions covered by sensors placed at A” with some radius Formally: W finite set, collection of n subsets Si µ W For A µV={1,…,n} define F(A) = |i2 A Si| 16
  • 17. Set cover is submodular A={S1,S2} OFFICE OFFICE QUIET PHONE CONFERENCE S1 S2 STORAGE LAB ELEC COPY SERVER S’ KITCHEN F(A[{S’})-F(A) OFFICE OFFICE QUIET PHONE ¸ CONFERENCE F(B[{S’})-F(B) S1 S2 STORAGE LAB ELEC COPY S3 SERVER S4 KITCHEN S’ B = {S1,S2,S3,S4} 17
  • 18. Example: Mutual information Given random variables X1,…,Xn F(A) = I(XA; XVnA) = H(XVnA) – H(XVnA |XA) Lemma: Mutual information F(A) is submodular F(A [ {s}) – F(A) = H(Xsj XA) – H(Xsj XVn(A[{s}) ) Nonincreasing in A: Nondecreasing in A AµB ) H(Xs|XA) ¸ H(Xs|XB) δ s(A) = F(A[{s})-F(A) monotonically nonincreasing  F submodular  18
  • 19. Example: Influence in social networks [Kempe, Kleinberg, Tardos KDD ’03] Dorothy Eric Alice 0.2 Prob. of influencing 0.5 0.4 0.3 0.2 0.5 Bob 0.5 Fiona Charlie Who should get free cell phones? V = {Alice,Bob,Charlie,Dorothy,Eric,Fiona} F(A) = Expected number of people influenced when targeting A 19
  • 20. Influence in social networks is submodular [Kempe, Kleinberg, Tardos KDD ’03] Dorothy Eric Alice 0.2 0.5 0.4 0.3 0.2 0.5 Bob 0.5 Fiona Charlie Key idea: Flip coins c in advance  “live” edges Fc(A) = People influenced under outcome c (set cover!) F(A) = ∑ P(c) F (A) is submodular as well! 20
  • 21. Closedness properties F1,…,Fm submodular functions on V and λ1,…,λm > 0 Then: F(A) = ∑i λi Fi(A) is submodular! Submodularity closed under nonnegative linear combinations! Extremely useful fact!! Fθ(A) submodular ) ∑θ P(θ) Fθ(A) submodular! Multicriterion optimization: F1,…,Fm submodular, λi¸0 ) ∑i λi Fi(A) submodular 21
  • 22. Submodularity and Concavity Suppose g: N ! R and F(A) = g(|A|) Then F(A) submodular if and only if g concave! E.g., g could say “buying in bulk is cheaper” g(|A|) |A| 22
  • 23. Maximum of submodular functions Suppose F1(A) and F2(A) submodular. Is F(A) = max(F1(A),F2(A)) submodular? F(A) = max(F1(A),F2(A)) F1(A) F2(A) |A| max(F1,F2) not submodular in general! 23
  • 24. Minimum of submodular functions Well, maybe F(A) = min(F1(A),F2(A)) instead? F1(A) F2(A) F(A) F({b}) – F(;)=0 ; 0 0 0 < {a} 1 0 0 F({a,b}) – F({a})=1 {b} 0 1 0 {a,b} 1 1 1 min(F1,F2) not submodular in general! But stay tuned – we’ll address mini Fi later! 24
  • 25. Duality For F submodular on V let G(A) = F(V) – F(VnA) G is supermodular and called dual to F Details about properties in [Fujishige ’91] F(A) |A| |A| G(A) 25
  • 26. Tutorial Overview Examples and properties of submodular functions Many problems submodular (mutual information, influence, …) SFs closed under positive linear combinations; not under min, max Submodularity and convexity Minimizing submodular functions Maximizing submodular functions Extensions and research directions 26
  • 27. Submodularity and Convexity Carnegie Mellon
  • 28. Submodularity and convexity For V = {1,…,n}, and A µ V, let wA = (w1A,…,wnA) with wiA = 1 if i 2 A, 0 otherwise Key result [Lovasz ’83]: Every submodular function F induces a function g on Rn+, such that F(A) = g(wA) for all A µ V g(w) is convex minA F(A) = minw g(w) s.t. w 2 [0,1]n Let’s see how one can define g(w) 28
  • 29. The submodular polyhedron PF PF = {x 2 Rn: x(A) · F(A) for all A µ V} Example: V = {a,b} A F(A) x(A) = ∑i2 A xi ; 0 {a} -1 {b} 2 {a,b} 0 x{b} 2 x({b}) · F({b}) PF 1 x({a,b}) · F({a,b}) -2 -1 0 1 x{a} x({a}) · F({a}) 29
  • 30. Lovasz extension Claim: g(w) = maxx2PF wTx PF = {x 2 Rn: x(A) · F(A) for all A µ V} w{b} xw=argmaxx2 PF wT x xw 2 g(w)=wT xw w 1 -2 -1 0 1 w{a} Evaluating g(w) requires solving a linear program with exponentially many constraints  30
  • 31. x{b} xw 2 Evaluating the Lovasz extension w 1 g(w) = maxx2PF wTx -2 -1 0 1 x{a} PF = {x 2 Rn: x(A) · F(A) for all A µ V} Theorem [Edmonds ’71, Lovasz ‘83]: For any given w, can get optimal solution xw to the LP using the following greedy algorithm: Order V={e1,…,en} so that w(e1)¸ …¸ w(en) M Let xw(ei) = F({e1,…,ei}) – F({e1,…,ei-1}) Then wT xw = g(w) = maxx2 PF wT x Sanity check: If w = wA and A={e1,…,ek}, then AT k 31
  • 32. Example: Lovasz extension A F(A) g(w) = max {wT x: x 2 PF} ; 0 w{b} {a} -1 {b} 2 [-2,2] 2 {a,b} 0 {b} {a,b} [-1,1] 1 w=[0,1] {} {a} want g(w) -2 -1 0 1 w{a} Greedy ordering: e1 = b, e2 = a  w(e1)=1 > w(e2)=0 g([0,1]) = [0,1]T [-2,2] = 2 = F({b}) xw(e1)=F({b})-F(;)=2 xw(e2)=F({b,a})-F({b})=-2 g([1,1]) = [1,1]T [-1,1] = 0 = F({a,b})  xw=[-2,2] 32
  • 33. Why is this useful? x{b} 2 [0,1]2 Theorem [Lovasz ’83]: 1 g(w) attains its minimum in [0,1]n at a corner! 0 1 x{a} If we can minimize g on [0,1]n, can minimize F… (at corners, g and F take same values) F(A) submodular g(w) convex (and efficient to evaluate) Does the converse also hold? No, consider g(w1,w2,w3) = max(w1,w2+w3) {a} {b} {c} F({a,b})-F({a})=0 < F({a,b,c})-F({a,c})=1 33
  • 34. Tutorial Overview Examples and properties of submodular functions Many problems submodular (mutual information, influence, …) SFs closed under positive linear combinations; not under min, max Submodularity and convexity Every SF induces a convex function with SAME minimum Special properties: Greedy solves LP over exponential polytope Minimizing submodular functions Maximizing submodular functions Extensions and research directions 34
  • 36. Overview minimization Minimizing general submodular functions Minimizing symmetric submodular functions Applications to Machine Learning 36
  • 37. Minimizing a submodular function Want to solve A* = argminA F(A) Need to solve minw maxx wTx g(w) s.t. w2[0,1]n, x2PF Equivalently: minc,w c s.t. c ¸ wT x for all x2PF w2 [0,1]n 37
  • 38. Ellipsoid algorithm [Grötschel, Lovasz, Schrijver ’81] Feasible region Optimality direction minc,w c s.t. c ¸ wT x for all x2PF w2 [0,1]n Separation oracle: Find most violated constraint: maxx wT x – c s.t. x 2 PF Can solve separation using the greedy algorithm!!  Ellipsoid algorithm minimizes SFs in poly-time! 38
  • 39. Minimizing submodular functions Ellipsoid algorithm not very practical Want combinatorial algorithm for minimization! Theorem [Iwata (2001)] There is a fully combinatorial, strongly polynomial algorithm for minimizing SFs, that runs in time O(n8 log2 n) Polynomial-time = Practical ??? 39
  • 40. A more practical alternative? [Fujishige ’91, Fujishige et al ‘06] x({a,b})=F({a,b}) x A F(A) {b} ; 0 2 {a} -1 x* {b} 2 [-1,1] 1 Base polytope: {a,b} 0 BF = PF Å {x(V) = F(V)} -2 -1 0 1 x{a} Minimum norm algorithm: M Find x* = argmin ||x||2 s.t. x 2 BF x*=[-1,1] Return A* = {i: x*(i) < 0} A*={a} Theorem [Fujishige ’91]: A* is an optimal solution! Note: Can solve 1. using Wolfe’s algorithm Runtime finite but unknown!!  40
  • 41. Empirical comparison [Fujishige et al ’06] Cut functions from DIMACS Lower is better (log-scale!) Challenge Running time (seconds) Minimum norm algorithm 64 128 256 512 1024 Problem size (log-scale!) Minimum norm algorithm orders of magnitude faster! Our implementation can solve n = 10k in < 6 minutes! 41
  • 42. Checking optimality (duality) BF A F(A) w{b} ; 0 x* 2 {a} -1 {b} 2 [-1,1] 1 {a,b} 0 Base polytope: -2 -1 0 1 w{a} BF = PF Å {x(V) = F(V)} Theorem [Edmonds ’70] A = {a}, F(A) = -1 minA F(A) = maxx {x (V) : x 2 BF} – w = [1,0] where x–(s) = min {x(s), 0} xw = [-1,1] xw- = [-1,0] Testing how close A’ is to minA F(A) xw-(V) = -1 Run greedy algorithm for w=wA’ to get xw  A optimal! F(A’) ¸ minA F(A) ¸ xw–(V) 42
  • 43. Overview minimization Minimizing general submodular functions Can minimizing in polytime using ellipsoid method Combinatorial, strongly polynomial algorithm O(n^8) Practical alternative: Minimum norm algorithm? Minimizing symmetric submodular functions Applications to Machine Learning 43
  • 44. What if we have special structure? Worst-case complexity of best known algorithm: O(n8 log2n) Can we do better for special cases? Example (again): Given RVs X1,…,Xn F(A) = I(XA; XVnA) = I(XVnA ; XA) = F(VnA) Functions F with F(A) = F(VnA) for all A are symmetric 44
  • 45. Another example: Cut functions 2 1 3 V={a,b,c,d,e,f,g,h} a c e g 2 3 2 2 2 3 3 3 b d f h 2 1 3 F(A) = ∑ {ws,t: s2 A, t2 Vn A} Example: F({a})=6; F({c,d})=10; F({a,b,c,d})=2 Cut function is symmetric and submodular! 45
  • 46. Minimizing symmetric functions For any A, submodularity implies 2 F(A) = F(A) + F(VnA) ¸ F(A Å (VnA))+F(A [ (VnA)) = F(;) + F(V) = 2 F(;) = 0 Hence, any symmetric SF attains minimum at ; In practice, want nontrivial partition of V into A and VnA, i.e., require that A is neither ; of V Want A* = argmin F(A) s.t. 0 < |A| < n There is an efficient algorithm for doing that!  46
  • 47. Queyranne’s algorithm (overview) [Queyranne ’98] Theorem: There is a fully combinatorial, strongly M polynomial algorithm for solving A* = argminA F(A) s.t. 0<|A|<n for symmetric submodular functions A Runs in time O(n3) [instead of O(n8)…] Note: also works for “posimodular” functions: F posimodular  A,Bµ V: F(A)+F(B) ¸ F(AnB)+F(BnA) 47
  • 48. Gomory Hu trees 2 1 3 6 2 9 a c e g a c e g 2 3 2 2 2 3 3 3 6 7 10 9 b d f h b d f h 2 1 3 G T A tree T is called Gomory-Hu (GH) tree for SF F if for any s, t 2 V it holds that min {F(A): s2A and t∉A} = min {wi,j: (i,j) is an edge on the s-t path in T} “min s-t-cut in T = min s-t-cut in G” Expensive to Theorem [Queyranne ‘93]: find one in GH-trees exist for any symmetric SF F! general!  48
  • 49. Pendent pairs For function F on V, s,t2 V: (s,t) is pendent pair if {s} 2 argminA F(A) s.t. s2A, t∉A Pendent pairs always exist: 6 2 9 a c e g 7 9 Gomory-Hu 6 10 tree T b d f h Take any leaf s and neighbor t, then (s,t) is pendent! E.g., (a,c), (b,c), (f,e), … Theorem [Queyranne ’95]: Can find pendent pairs in O(n2) (without needing GH-tree!) 49
  • 50. Why are pendent pairs useful? Key idea: Let (s,t) pendent, A* = argmin F(A) Then EITHER V s and t separated by A*, e.g., s2A*, t∉A*. A* But then A*={s}!! OR s t s and t are not separated by A* V V A* s A* t s t Then we can merge s and t… 50
  • 51. Merging Suppose F is a symmetric SF on V, and we want to merge pendent pair (s,t) Key idea: “If we pick s, get t for free” V’ = Vn{t} F’(A) = F(A[{t}) if s2A, or = F(A) if s∉A Lemma: F’ is still symmetric and submodular! 2 1 3 1 3 a c e g Merge a,c e g 2 3 (a,c) 3 2 2 2 3 3 3 4 4 3 3 3 b d f h b d f h 2 1 3 2 1 3 51
  • 52. Queyranne’s algorithm Input: symmetric SF F on V, |V|=n Output: A* = argmin F(A) s.t. 0 < |A| < n Initialize F’ Ã F, and V’ Ã V For i = 1:n-1 (s,t) Ã pendentPair(F’,V’) Ai = {s} (F’,V’) Ã merge(F’,V’,s,t) Return argmini F(Ai) Running time: O(n3) function evaluations 52
  • 53. Note: Finding pendent pairs Initialize v1 à x (x is arbitrary element of V) For i = 1 to n-1 do Wi à {v1,…,vi} vi+1 à argminv F(Wi[{v}) - F({v}) s.t. v2 VnWi Return pendent pair (vn-1,vn) Requires O(n2) evaluations of F 53
  • 54. Overview minimization Minimizing general submodular functions Can minimizing in polytime using ellipsoid method Combinatorial, strongly polynomial algorithm O(n8) Practical alternative: Minimum norm algorithm? Minimizing symmetric submodular functions Many useful submodular functions are symmetric Queyranne’s algorithm minimize symmetric SFs in O(n3) Applications to Machine Learning 54
  • 55. Application: Clustering [Narasimhan, Jojic, Bilmes NIPS ’05] Group data points V into “homogeneous clusters” o V o Find a partition V=A1 [ … [ Ak o o o that minimizes o o A1 o A o o o 2 F(A1,…,Ak) = ∑i E(Ai) “Inhomogeneity of Ai” Examples for E(A): Entropy H(A) Cut function Special case: k = 2. Then F(A) = E(A) + E(VnA) is symmetric! If E is submodular, can use Queyranne’s algorithm!  55
  • 56. What if we want k>2 clusters? [Zhao et al ’05, Narasimhan et al ‘05] X2 Greedy Splitting algorithm M X1 X3 X5 X7 X4 X6 Start with partition P = {V} For i = 1 to k-1 X1 X3 X2 X7 For each member Cj 2 P do X4 X6 X5 split cluster Cj: A* = argmin E(A) + E(CjnA) s.t. 0<|A|<|Cj| Pj à P n {Cj} [ {A,CjnA} X3 X1 X2 X7 Partition we get by splitting j-th cluster X6 X4 X5 P à argminj F(Pj) Theorem: F(P) · (2-2/k) F(Popt) 56
  • 57. Example: Clustering species [Narasimhan et al ‘05] Common genetic information = #of common substrings: Can easily extend to sets of species 57
  • 58. Example: Clustering species [Narasimhan et al ‘05] The common genetic information ICG does not require alignment captures genetic similarity is smallest for maximally evolutionarily diverged species is a symmetric submodular function!  Greedy splitting algorithm yields phylogenetic tree! 58
  • 59. Example: SNPs [Narasimhan et al ‘05] Study human genetic variation (for personalized medicine, …) Most human variation due to point mutations that occur once in human history at that base location: Single Nucleotide Polymorphisms (SNPs) Cataloging all variation too expensive ($10K-$100K per individual!!) 59
  • 60. SNPs in the ACE gene [Narasimhan et al ‘05] Rows: Individuals. Columns: SNPs. Which columns should we pick to reconstruct the rest? Can find near-optimal clustering (Queyranne’s algorithm) 60
  • 61. Reconstruction accuracy [Narasimhan et al ‘05] Comparison with clustering based on Entropy Prediction accuracy Pairwise correlation # of clusters PCA 61
  • 62. Example: Speaker segmentation [Reyes-Gomez, Jojic ‘07] Region A “Fiona” Mixed waveforms Frequency Alice “???” Time Fiona “???” E(A)=-log p(XA) Partition Likelihood of Spectro- “region” A “308” gram F(A)=E(A)+E(VnA) using Q-Algo symmetric “217” & posimodular 62
  • 64. Example: Image denoising Pairwise Markov Random Field Y1 Y2 Y3 X1 X2 X3 P(x1,…,xn,y1,…,yn) = ∏i,j ψi,j(yi,yj) Πi φi(xi,yi) Y4 Y5 Y6 X4 X5 X6 Want argmaxy P(y | x) Y7 Y8 Y9 =argmaxy log P(x,y) =argminy ∑i,j Ei,j(yi,yj)+∑i Ei(yi) X7 X8 X9 Xi: noisy pixels Ei,j(yi,yj) = -log ψi,j(yi,yj) Yi: “true” pixels When is this MAP inference efficiently solvable (in high treewidth graphical models)? 64
  • 65. MAP inference in Markov Random Fields [Kolmogorov et al, PAMI ’04, see also: Hammer, Ops Res ‘65] Energy E(y) = ∑i,j Ei,j(yi,yj)+∑i Ei(yi) Suppose yi are binary, define F(A) = E(yA) where yAi = 1 iff i2 A Then miny E(y) = minA F(A) Theorem MAP inference problem solvable by graph cuts  For all i,j: Ei,j(0,0)+Ei,j(1,1) · Ei,j(0,1)+Ei,j(1,0)  each Ei,j is submodular “Efficient if prefer that neighboring pixels have same color” 65
  • 66. Constrained minimization Have seen: if F submodular on V, can solve A*=argmin F(A) s.t. A2V What about A*=argmin F(A) s.t. A2V and |A| ·k E.g., clustering with minimum # points per cluster, … In general, not much known about constrained minimization  However, can do A*=argmin F(A) s.t. 0<|A|< n A*=argmin F(A) s.t. |A| is odd/even [Goemans&Ramakrishnan ‘95] A*=argmin F(A) s.t. A 2 argmin G(A) for G submodular [Fujishige ’91] 66
  • 67. Overview minimization Minimizing general submodular functions Can minimizing in polytime using ellipsoid method Combinatorial, strongly polynomial algorithm O(n8) Practical alternative: Minimum norm algorithm? Minimizing symmetric submodular functions Many useful submodular functions are symmetric Queyranne’s algorithm minimize symmetric SFs in O(n3) Applications to Machine Learning Clustering [Narasimhan et al’ 05] Speaker segmentation [Reyes-Gomez & Jojic ’07] MAP inference [Kolmogorov et al ’04] 67
  • 68. Tutorial Overview Examples and properties of submodular functions Many problems submodular (mutual information, influence, …) SFs closed under positive linear combinations; not under min, max Submodularity and convexity Every SF induces a convex function with SAME minimum Special properties: Greedy solves LP over exponential polytope Minimizing submodular functions Minimization possible in polynomial time (but O(n8)…) Queyranne’s algorithm minimizes symmetric SFs in O(n3) Useful for clustering, MAP inference, structure learning, … Maximizing submodular functions Extensions and research directions 68
  • 69. Maximizing submodular functions Carnegie Mellon
  • 70. Maximizing submodular functions Minimizing convex functions: Minimizing submodular functions: Polynomial time solvable! Polynomial time solvable! Maximizing convex functions: Maximizing submodular functions: NP hard! NP hard! But can get approximation guarantees  70
  • 71. Maximizing influence [Kempe, Kleinberg, Tardos KDD ’03] Dorothy Eric Alice 0.2 0.5 0.4 0.3 0.2 0.5 Bob 0.5 Fiona Charlie F(A) = Expected #people influenced when targeting A F monotonic: If AµB: F(A) ·F(B) Hence V = argmaxA F(A) More interesting: argmaxA F(A) – Cost(A) 71
  • 72. Maximizing non-monotonic functions maximum Suppose we want for not monotonic F A* = argmax F(A) s.t. AµV |A| Example: F(A) = U(A) – C(A) where U(A) is submodular utility, and C(A) is supermodular cost function E.g.: Trading off utility and privacy in personalized search [Krause & Horvitz AAAI ’08] In general: NP hard. Moreover: If F(A) can take negative values: As hard to approximate as maximum independent set 72 1-ε
  • 73. Maximizing positive submodular functions [Feige, Mirrokni, Vondrak FOCS ’07] Theorem There is an efficient randomized local search procedure, that, given a positive submodular function F, F(;)=0, returns set ALS such that F(ALS) ¸(2/5) maxA F(A) picking a random set gives ¼ approximation (½ approximation if F is symmetric!) we cannot get better than ¾ approximation unless P = NP 73
  • 74. Scalarization vs. constrained maximization Given monotonic utility F(A) and cost C(A), optimize: Option 1: Option 2: maxA F(A) – C(A) maxA F(A) s.t. A µ V s.t. C(A) · B “Scalarization” “Constrained maximization” Can get 2/5 approx… coming up… if F(A)-C(A) ¸ 0 for all A µ V Positiveness is a strong requirement  74
  • 75. Constrained maximization: Outline Monotonic submodular Selected set Selection cost Budget Subset selection: C(A) = |A| Robust optimization Complex constraints 75
  • 76. Monotonicity A set function is called monotonic if AµBµV ) F(A) · F(B) Examples: Influence in social networks [Kempe et al KDD ’03] For discrete RVs, entropy F(A) = H(XA) is monotonic: Suppose B=A [C. Then F(B) = H(XA, XC) = H(XA) + H(XC | XA) ¸ H(XA) = F(A) Information gain: F(A) = H(Y)-H(Y | XA) Set cover Matroid rank functions (dimension of vector spaces, …) … 76
  • 77. Subset selection Given: Finite set V, monotonic submodular function F, F(;) = 0 Want: A*µ V such that NP-hard! 77
  • 78. Exact maximization of monotonic submodular functions 1) Mixed integer programming [Nemhauser et al ’81] max η s.t. η · F(B) + ∑s2VnB αs δ s(B) for all B µ S ∑s αs · k αs 2 {0,1} where δ s(B) = F(B [ {s}) – F(B) Solved using constraint generation 2) Branch-and-bound: “Data-correcting algorithm” M [Goldengorin et al ’99] Both algorithms worst-case exponential! 78
  • 79. Approximate maximization Given: finite set V, monotonic submodular function F(A) Want: A*µ V such that NP-hard! Y “Sick” Greedy algorithm: M X1 X2 X3 Start with A0 = ; “Fever” “Rash” “Male” For i = 1 to k si := argmaxs F(Ai-1 [ {s}) - F(Ai-1) Ai := Ai-1 [ {si} 79
  • 80. Performance of greedy algorithm Theorem [Nemhauser et al ‘78] Given a monotonic submodular function F, F(;)=0, the greedy maximization algorithm returns Agreedy F(Agreedy) ¸(1-1/e) max|A|· k F(A) ~63% Sidenote: Greedy algorithm gives 1/2 approximation for maximization over any matroid C! [Fisher et al ’78] 80
  • 81. An “elementary” counterexample X1, X2 ~ Bernoulli(0.5) X1 X2 Y = X1 XOR X2 Y Let F(A) = IG(XA; Y) = H(Y) – H(Y|XA) Y | X1 and Y | X2 ~ Bernoulli(0.5) (entropy 1) Y | X1,X2 is deterministic! (entropy 0) Hence F({1,2}) – F({1}) = 1, but F({2}) – F(;) =0 F(A) submodular under some conditions! (later) 81
  • 82. Example: Submodularity of info-gain Y1,…,Ym, X1, …, Xn discrete RVs F(A) = IG(Y; XA) = H(Y)-H(Y | XA) F(A) is always monotonic However, NOT always submodular Theorem [Krause & Guestrin UAI’ 05] If Xi are all conditionally independent given Y, then F(A) is submodular! Y1 Y2 Y3 Hence, greedy algorithm works! In fact, NO algorithm can do better X1 X2 X3 X4 than (1-1/e) approximation! 82
  • 83. Building a Sensing Chair [Mutlu, Krause, Forlizzi, Guestrin, Hodgins UIST ‘07] People sit a lot Activity recognition in assistive technologies Seating pressure as user interface Equipped with 1 sensor per cm2! Costs $16,000!  Lean Lean Slouch Can we get similar left forward accuracy with fewer, 82% accuracy on cheaper sensors? 10 postures! [Tan et al]83
  • 84. How to place sensors on a chair? Sensor readings at locations V as random variables Predict posture Y using probabilistic model P(Y,V) Pick sensor locations A* µ V to minimize entropy: Possible locations V Placed sensors, did a user study: Accuracy Cost Before 82% $16,000  After 79% $100  Similar accuracy at <1% of the cost! 84
  • 85. Variance reduction (a.k.a. Orthogonal matching pursuit, Forward Regression) Let Y = ∑i αi Xi+ε, and (X1,…,Xn,ε) » N(¢; µ,Σ) Want to pick subset XA to predict Y Var(Y | XA=xA): conditional variance of Y given XA = xA Expected variance: Var(Y | XA) = s p(xA) Var(Y | XA=xA) dxA Variance reduction: FV(A) = Var(Y) – Var(Y | XA) FV(A) is always monotonic *under some Theorem [Das & Kempe, STOC ’08] conditions on Σ FV(A) is submodular*  Orthogonal matching pursuit near optimal! [see other analyses by Tropp, Donoho et al., and Temlyakov] 85
  • 86. Batch mode active learning [Hoi et al, ICML’06] o o Which data points o should we o + o o – o o o label to minimize error? + o o o – o o Want batch A of k points to o show an expert for labeling F(A) selects examples that are uncertain [σ2(s) = π(s) (1-π(s)) is large] diverse (points in A are as different as possible) relevant (as close to VnA is possible, sT s’ large) F(A) is submodular and monotonic! [approximation to improvement in Fisher-information] 86
  • 87. Results about Active Learning [Hoi et al, ICML’06] Batch mode Active Learning performs better than Picking k points at random Picking k points of highest entropy 87
  • 88. Monitoring water networks [Krause et al, J Wat Res Mgt 2008] Contamination of drinking water could affect millions of people Contamination Sensors Simulator from EPA Hach Sensor Place sensors to detect contaminations ~$14K “Battle of the Water Sensor Networks” competition Where should we place sensors to quickly detect contamination? 88
  • 89. Model-based sensing Utility of placing sensors based on model of the world For water networks: Water flow simulator from EPA F(A)=Expected impact reduction placing sensors at A Model predicts Low impact High impact Theorem [Krause et al., J Wat location : Contamination Res Mgt ’08] Impact reduction F(A) in water networks is submodular! Medium impact S3 location S1 2 S S3 S1 S4 S2 Sensor reduces S4 impact through Set V of all early detection! S1 network junctions High impact reduction F(A) = 0.9 Low impact reduction F(A)=0.01 89
  • 90. Battle of the Water Sensor Networks Competition Real metropolitan area network (12,527 nodes) Water flow simulator provided by EPA 3.6 million contamination events Multiple objectives: Detection time, affected population, … Place sensors that detect well “on average” 90
  • 91. Bounds on optimal solution [Krause et al., J Wat Res Mgt ’08] 1.4 Offline Population protected F(A) 1.2 (Nemhauser) bound Higher is better 1 0.8 0.6 Water Greedy networks 0.4 solution data 0.2 0 0 5 10 15 20 Number of sensors placed (1-1/e) bound quite loose… can we get better bounds? 91
  • 92. Data dependent bounds [Minoux ’78] Suppose A is candidate solution to argmax F(A) s.t. |A| · k and A* = {s1,…,sk} be an optimal solution Then F(A*) · F(A [ A*) = F(A)+∑i F(A[{s1,…,si})-F(A[ {s1,…,si-1}) · F(A) + ∑i (F(A[{si})-F(A)) = F(A) + ∑i δ si M For each s 2 VnA, let δ s = F(A[{s})-F(A) Order such that δ 1 ¸ δ 2 ¸ … ¸ δ n 92
  • 93. Bounds on optimal solution [Krause et al., J Wat Res Mgt ’08] 1.4 Offline 1.2 (Nemhauser) Sensing quality F(A) bound Data-dependent Higher is better 1 bound 0.8 0.6 Water Greedy networks 0.4 solution data 0.2 0 0 5 10 15 20 Number of sensors placed Submodularity gives data-dependent bounds on the performance of any algorithm 93
  • 94. BWSN Competition results [Ostfeld et al., J Wat Res Mgt 2008] 13 participants Performance measured in 30 different criteria G: Genetic algorithm D: Domain knowledge H: Other heuristic E: “Exact” method (MIP) 30 25 G H E Higher is better 20 Total Score E G 15 H G 10 G H D D G 5 0 . i l. . . an i s r ld ll ou sk . el al al al lle on ta do al tfe Gu m al rp et et et Pi m rk ie W ht Os et ca lo an & Ba g rry rin ac an & ly Sa & o Gu se Be Tr Po Do & at u Hu s & W 24% better performance than runner-up!  op ei i re au & ld Pr Pr im s tfe Kr de Gh Os ia El 94
  • 95. What was the trick? Simulated all 3.6M contaminations on 2 weeks / 40 processors 152 GB data on disk, 16 GB in main memory (compressed)  Very accurate computation of F(A) Very slow evaluation of F(A)  30 hours/20 sensors Running time (minutes) 300 Lower is better Exhaustive search (All subsets) 6 weeks for all 200 30 settings  Naive greedy ubmodularity 100 to the rescue 0 1 2 3 4 5 6 7 8 9 10 Number of sensors selected 95
  • 96. Scaling up greedy algorithm [Minoux ’78] In round i+1, have picked Ai = {s1,…,si} pick si+1 = argmaxs F(Ai [ {s})-F(Ai) I.e., maximize “marginal benefit” δ s(Ai) δ s(Ai) = F(Ai [ {s})-F(Ai) Key observation: Submodularity implies δ s(Ai) ¸ δ s(Ai+1) i · j ) δ s(Ai) ¸ δ s(Aj) s Marginal benefits can never increase! 96
  • 97. “Lazy” greedy algorithm [Minoux ’78] Lazy greedy algorithm: M  First iteration as usual Benefit δ s(A) a Keep an ordered list of marginal benefits δ i from previous iteration b d Re-evaluate δ i only for top b c element e d If δ i stays on top, use it, c e otherwise re-sort Note: Very easy to compute online bounds, lazy evaluations, etc. [Leskovec et al. ’07] 97
  • 98. Result of lazy evaluation Simulated all 3.6M contaminations on 2 weeks / 40 processors 152 GB data on disk, 16 GB in main memory (compressed)  Very accurate computation of F(A) Very slow evaluation of F(A)  30 hours/20 sensors Running time (minutes) 300 Lower is better Exhaustive search (All subsets) 6 weeks for all 200 30 settings  Naive greedy ubmodularity 100 Fast greedy to the rescue: 0 Using “lazy evaluations”: 1 2 3 4 5 6 7 8 9 10 Number of sensors selected 1 hour/20 sensors Done after 2 days!  98
  • 99. What about worst-case? [Krause et al., NIPS ’07] Knowing the sensor locations, an adversary contaminates here! S3 S2 S3 S4 S2 S1 S4 S1 Placement detects Very different average-case impact, well on “average-case” Same worst-case impact (accidental) contamination Where should we place sensors to quickly detect in the worst case? 99
  • 100. Constrained maximization: Outline Utility function Selected set Selection cost Budget Subset selection Robust optimization Complex constraints 100
  • 101. Optimizing for the worst case Separate utility function Fi for each contamination i Fi(A) = impact reduction by sensors A for contamination i Contamination at node s Sensors A Want to solve (A) Fs(B) is high Sensors B Contamination Each of the Fi is submodular at node r Unfortunately, mini Fi not submodular! Fr(A) is high (B) low 101 How can we solve this robust optimization problem?
  • 102. How does the greedy algorithm do? V={ , , } Can only buy k=2 Set A F1 F2 mini Fi 1 0 0 Greedy picks 0 2 0 first Optimal ε ε ε Hence we can’t find any Then, can solution 1 ε ε choose only approximation algorithm. ε 2 ε or Optimal score: 1 Or can 2 1 1 we? Greedy score: ε  Greedy does arbitrarily badly. Is there something better? Theorem [NIPS ’07]: The problem max|A|· k mini Fi(A) does not admit any approximation unless P=NP 102
  • 103. Alternative formulation If somebody told us the optimal value, can we recover the optimal solution A*? Need to find Is this any easier? Yes, if we relax the constraint |A| · k 103
  • 104. Solving the alternative problem Trick: For each Fi and c, define truncation Fi(A) c Remains F’i,c(A) submodular! |A| Problem 1 (last slide) Problem 2 Non-submodular  optimal solutions!Submodular! Same Don’t know how to solve solves the other as constraint? Solving one But appears 104
  • 105. Maximization vs. coverage Previously: Wanted A* = argmax F(A) s.t. |A| · k Now need to solve: A* = argmin |A| s.t. F(A) ¸ Q M Greedy algorithm: Start with A := ;; For bound, assume While F(A) < Q and |A|< n F is integral. s* := argmaxs F(A [ {s}) If not, just round it. A := A [ {s*} Theorem [Wolsey et al]: Greedy will return Agreedy |A | · (1+log max F({s})) |A | 105
  • 106. Solving the alternative problem Trick: For each Fi and c, define truncation Fi(A) c F’i,c(A) |A| Problem 1 (last slide) Problem 2 Non-submodular  Submodular! Don’t know how to solve Can use greedy algorithm! 106
  • 107. Back to our example Guess c=1 Set A F1 F2 mini Fi F’avg,1 First pick 1 0 0 ½ Then pick 0 2 0 ½  Optimal solution! ε ε ε ε 1 ε ε (1+ε)/2 ε 2 ε (1+ε)/2 1 2 1 1 How do we find c? Do binary search! 107
  • 108. SATURATE Algorithm [Krause et al, NIPS ‘07] Given: set V, integer k and monotonic SFs F1,…,Fm M Initialize cmin=0, cmax = mini Fi(V) Do binary search: c = (cmin+cmax)/2 Greedily find AG such that F’avg,c(AG) = c If |AG| · α k: increase cmin If |AG| > α k: decrease cmax until convergence Truncation threshold (color) 108
  • 109. Theoretical guarantees [Krause et al, NIPS ‘07] Theorem: The problem max|A|· k mini Fi(A) does not admit any approximation unless P=NP  Theorem: SATURATE finds a solution AS such that mini Fi(AS) ¸ OPTk and |AS| · α k where OPTk = max|A|·k mini Fi(A) α = 1 + log maxs ∑i Fi({s}) Theorem: If there were a polytime algorithm with better factor β < α, then NP µ DTIME(nlog log n) 109
  • 110. Example: Lake monitoring Monitor pH values using robotic sensor transect Prediction at unobserved Observations A locations pH value True (hidden) pH values Var(s | A) Use probabilistic model (Gaussian processes) Position s along transect to estimate prediction error Where should we sense to minimize our maximum error?  Robust submodular (often) submodular optimization problem! [Das & Kempe ’08] 110
  • 111. Comparison with state of the art Algorithm used in geostatistics: Simulated Annealing [Sacks & Schiller ’88, van Groeningen & Stein ’98, Wiens ’05,…] 7 parameters that need to be fine-tuned 0.25 Maximum marginal variance nce 2.5 0.2 a um arginal varia better G dy ree Greedy 2 0.15 SATURATE 1.5 Simulated M xim m 0.1 Annealing 1 0.05 S u im lated A n alin ne g SATURATE 0.5 0 0 20 40 60 0 20 40 60 80 100 SATURATE is competitive & 10x faster N m er of se sors u b n Number of sensors Environmental monitoring Precipitation data No parameters to tune! 111
  • 112. Results on water networks 3000 Maximum detection time (minutes) No decrease 2500 until all Greedy Lower is better 2000 contaminations Simulated detected! 1500 Annealing 1000 500 SATURATE 0 0 10 20 Water networks Number of sensors 60% lower worst-case detection time! 112
  • 113. Worst- vs. average case Given: Set V, submodular functions F1,…,Fm Average-case score Worst-case score Too optimistic? Very pessimistic! Want to optimize both average- and worst-case score! Can modify SATURATE to solve this problem!  Want: Fac(A) ¸ cac and Fwc(A) ¸ cwc Truncate: min{Fac(A),cac} + min{Fwc(A),cwc} ¸ cac+cwc 113
  • 114. Worst- vs. average case 7000 Only optimize for 6000 average case Worst case impact lower is better 5000 Knee in 4000 tradeoff curve Water 3000 Tradeoffs networks (SATURATE) 2000 data 1000 Only optimize for worst case 0 0 50 100 150 200 250 300 350 Can findAverage case impact good compromise between average- and is better lower worst-case score! 114
  • 115. Constrained maximization: Outline Utility function Selected set Selection cost Budget Subset selection Robust optimization Complex constraints 115
  • 116. Other aspects: Complex constraints maxA F(A) or maxA mini Fi(A) subject to So far: |A| · k In practice, more complex constraints: Different costs: C(A) · B Locations need to be Sensors need to communicate connected by paths (form a routing tree) [Chekuri & Pal, FOCS ’05] [Singh et al, IJCAI ’07] Lake monitoring Building monitoring 116
  • 117. Non-constant cost functions For each s 2 V, let c(s)>0 be its cost (e.g., feature acquisition costs, …) Cost of a set C(A) = ∑s2 A c(s) (modular function!) Want to solve A* = argmax F(A) s.t. C(A) · B Cost-benefit greedy algorithm: M Start with A := ;; While there is an s2VnA s.t. C(A[{s}) · B A := A [ {s*} 117
  • 118. Performance of cost-benefit greedy Want Set A F(A) C(A) {a} 2ε ε maxA F(A) s.t. C(A)· 1 {b} 1 1 Cost-benefit greedy picks a. Then cannot afford b!  Cost-benefit greedy performs arbitrarily badly! 118
  • 119. Cost-benefit optimization [Wolsey ’82, Sviridenko ’04, Leskovec et al ’07] Theorem [Leskovec et al. KDD ‘07] ACB: cost-benefit greedy solution and AUC: unit-cost greedy solution (i.e., ignore costs) Then max { F(ACB), F(AUC) } ¸ ½ (1-1/e) OPT Can still compute online bounds and speed up using lazy evaluations Note: Can also get (1-1/e) approximation in time O(n4) [Sviridenko ’04] Slightly better than ½ (1-1/e) in O(n2) [Wolsey ‘82] 119
  • 120. Example: Cascades in the Blogosphere [Leskovec, Krause, Guestrin, Faloutsos, VanBriesen, Glance ‘07] Time Learn about story after us! Information cascade Which blogs should we read to learn about big cascades early? 120
  • 121. Water vs. Web Placing sensors in Selecting water networks vs. informative blogs In both problems we are given Graph with nodes (junctions / blogs) and edges (pipes / links) Cascades spreading dynamically over the graph (contamination / citations) Want to pick nodes to detect big cascades early In both applications, utility functions submodular  [Generalizes Kempe et al, KDD ’03] 121
  • 122. Performance on Blog selection 0.7 400 Lower is better Greedy Higher is better Running time (seconds) 0.6 Exhaustive search Cascades captured 300 (All subsets) 0.5 0.4 Naive 200 greedy In-links 0.3 All outlinks 0.2 # Posts 100 Fast greedy 0.1 Random 0 0 0 20 40 60 80 100 1 2 3 4 5 6 7 8 9 10 Number of blogs Number of blogs selected Blog selection Blog selection ~45k blogs Outperforms state-of-the-art heuristics 700x speedup using submodularity! 122
  • 123. Cost of reading a blog skip Naïve approach: Just pick 10 best blogs Selects big, well known blogs (Instapundit, etc.) These contain many posts, take long to read! Cost/benefit 0.6 Cascades captured analysis 0.4 Ignoring cost 0.2 0 0 0 2 1 4 26 8 3 10 412 5 14 4 x10 Cost(A) = Number of posts / day Cost-benefit optimization picks summarizer blogs! 123
  • 124. Predicting the “hot” blogs Want blogs that will be informative in the future Split data set; train on historic, test on future Detects on training set 0.25 Greedy on future G edy re Test on future Cascades captured s “Cheating” ction 0.2 200 #dete 0.15 0 Jan Feb M ar Ar p May 0.1 Greedy on historic Test on future Blog selection “overfits” Detect well S rate atu Detect poorly s 0.05 #detection 200 here! training data! to here! 0 0 0 2 1000 4 6 2000 8 10 3000 12 14 4000 Poor generalization! Cost(A) = Number of posts / day 0 Want blogs that May JaWhy’s that? Apr n F eb Mar Let’s see what continue to do well! goes wrong here. 124
  • 125. Robust optimization G edy re “Overfit” blog ctions 200 selection A #dete 0 Jan Feb M ar Ar p May Fi(A) = detections S rate atu F1(A)=.5 F (A)=.6 F5 (A)=.02 G edy re 3 in interval i #dete s s #dete ction ction F2 (A)=.8 F4(A)=.01 200 200 Optimize 0 Jan Feb Mar A pr M ay 0 worst-case Jan Feb M ar S rate atu Ar p May s Detections using SATURATE ction “Robust” blog 200 #dete selection A* 0 Jan Feb Mar Apr May Robust optimization  Regularization! 125
  • 126. Predicting the “hot” blogs 0.25 Greedy on future Test on future Robust solution Cascades captured “Cheating” 0.2 Test on future 0.15 0.1 Greedy on historic Test on future 0.05 0 00 2 4 1000 6 2000 8 10 300012 14 4000 Cost(A) = Number of posts / day 50% better generalization! 126
  • 127. Other aspects: Complex constraints skip maxA F(A) or maxA mini Fi(A) subject to So far: |A| · k In practice, more complex constraints: Different costs: C(A) · B Sensors need to communicate Locations need to be (form a routing tree) connected by paths [Chekuri & Pal, FOCS ’05] [Singh et al, IJCAI ’07] Building monitoring Lake monitoring 127
  • 128. Naïve approach: Greedy-connect long Simple heuristic: Greedily optimize submodular utility function F(A) Then add nodes to minimize communication cost C(A) 50 1 51 OFFICE 2 2 OFFICE 54 9 11 12 QUIET PHONE 15 16 1.5 8 53 relayF(A) = 52 10 13 14 No communication 0.2 CONFERENCE 49 17 1 7 Most 18 F(A) = 4 STORAGE 48 ELEC COPY 5 node possible! 6 LAB 19 11 informative 2 C(A) 1 3 47 C(A) = 1 = C(A) = 10 4 20 46 21 3 45 2 SERVER 22 44 relay 43 39 KITCHEN 37 1 33 Second 29 27 23 = node C(A) 2 3.5 most informative 35 F(A)2 3.5 = 40 31 efficient 25 24 42 41 34 38 36 32 30 28 26 communication! Very informative, Communication cost = Expected # of trials Not very High communication (learned using Gaussian Processes) Want to find optimal tradeoff informative  cost!  between information and communication cost 128
  • 129. The pSPIEL Algorithm [Krause, Guestrin, Gupta, Kleinberg IPSN 2006] pSPIEL: Efficient nonmyopic algorithm (padded Sensor Placements at Informative and cost- Effective Locations) 1 2 Decompose sensing region 2 1 into small, well-separated C1 C2 clusters Solve cardinality constrained 3 2 problem per cluster (greedy) 1 3 C4 C3 Combine solutions using 2 1 k-MST algorithm 129

Notes de l'éditeur

  1. Get p+1 approximation for C = intersection of p matroids; 1-1/e whp for any matroid using continuous greedy + pipage rounding
  2. Random placement: 53%; Uniform: 73% Optimized: 81%