SlideShare une entreprise Scribd logo
1  sur  102
Télécharger pour lire hors ligne
Sparsity with sign-coherent groups of variables via the
                        cooperative-Lasso

            Julien Chiquet1 , Yves Grandvalet2 , Camille Charbonnier1

                    1                                          e ´
                        Statistique et G´nome, CNRS & Universit´ d’Evry Val d’Essonne
                                        e
                        2   Heudiasyc, CNRS & Universit´ de Technologie de Compi`gne
                                                       e                        e


                                             SSB – 29 mars 2011


         arXiv preprint.
         http://arxiv.org/abs/1103.2697

         R-package scoop.
         http://stat.genopole.cnrs.fr/logiciels/scoop



cooperative-Lasso                                                                       1
Notations

  Let
         Y be the output random variable,
         X = (X 1 , . . . , X p ) be the input random variables, where X j is the
         jth predictor.

  The data
  Given a sample {(yi , xi ), i = 1, . . . , n} of i.id. realizations of (Y, X),
  denote
         y = (y1 , . . . , yn ) the response vector,
         xj = (xj , . . . , xj ) the vector of data for the jth predictor,
                1            n
         X the n × p design matrix of data whose jth column is xj ,
         D = {i : (yi , xi ) ∈ training set},
         T = {i : (yi , xi ) ∈ test set}.


cooperative-Lasso                                                                   2
Generalized linear models
  Suppose Y depends linearly on X through a function g:

                                  E(Y ) = g(Xβ ).
                                        ˆ
  We predict a response yi by yi = g(xi β) for any i ∈ T by solving
                              ˆ
                    ˆ
                    β = arg max   D (β)   = arg min         Lg (yi , xi β),
                          β                    β      i∈D

  where Lg is a loss function depending on the function g. Typically,
      if Y is Gaussian and g = Id (OLS),

                                  Lg (y, xβ) = (y − xβ)2

         if Y is binary and g : t → g(t) = (1 + e−t )−1 (logistic regression)

                        Lg (y, xβ) = − y · xβ − log 1 + exβ

  or any negative log-likelihood      of an exponential family distribution.
cooperative-Lasso                                                               3
Generalized linear models
  Suppose Y depends linearly on X through a function g:

                                  E(Y ) = g(Xβ ).
                                        ˆ
  We predict a response yi by yi = g(xi β) for any i ∈ T by solving
                              ˆ
                    ˆ
                    β = arg max   D (β)   = arg min         Lg (yi , xi β),
                          β                    β      i∈D

  where Lg is a loss function depending on the function g. Typically,
      if Y is Gaussian and g = Id (OLS),

                                  Lg (y, xβ) = (y − xβ)2

         if Y is binary and g : t → g(t) = (1 + e−t )−1 (logistic regression)

                        Lg (y, xβ) = − y · xβ − log 1 + exβ

  or any negative log-likelihood      of an exponential family distribution.
cooperative-Lasso                                                               3
Estimation and selection at the group level
    1. Structure: the set I = {1, . . . , p} splits into a known partition.
                                K
                           I=         Gk , with Gk ∩ G = ∅, k = .
                                k=1
    2. Sparsity: the support S of β has few entries.
                           S = {i : βi = 0}, such as |S|           p.

  The group-Lasso estimator
  Grandvalet and Canu ’98, Bakin ’99, Yuan and Lin ’06
                                                        K
                    ˆgroup = arg min −
                    β                      D (β)   +λ         wk β Gk   .
                             β∈Rp                       k=1


         λ ≥ 0 controls the overall amount of penalty,
         wk > 0 adapts the penalty between groups (dropped hereafter).
cooperative-Lasso                                                             4
Estimation and selection at the group level
    1. Structure: the set I = {1, . . . , p} splits into a known partition.
                                K
                           I=         Gk , with Gk ∩ G = ∅, k = .
                                k=1
    2. Sparsity: the support S of β has few entries.
                           S = {i : βi = 0}, such as |S|           p.

  The group-Lasso estimator
  Grandvalet and Canu ’98, Bakin ’99, Yuan and Lin ’06
                                                        K
                    ˆgroup = arg min −
                    β                      D (β)   +λ         wk β Gk   .
                             β∈Rp                       k=1


         λ ≥ 0 controls the overall amount of penalty,
         wk > 0 adapts the penalty between groups (dropped hereafter).
cooperative-Lasso                                                             4
Toy example: the prostate dataset
             Examines the correlation between the prostate specific antigen and 8
             clinical measures for 97 patients.
             svi



 lweight

     lcavol                                                         lcavol log(cancer volume)
                                                                    lweight log(prostate weight)
                                                                    age age
coefficients




                                                                    lbph log(benign prostatic
                                                                    hyperplasia amount)
                                                                    svi seminal vesicle invasion
                                                                    lcp log(capsular penetration)
        lbph
 gleason                                                            gleason Gleason score
    pgg45
      age                                                           pgg45 percentage Gleason scores 4
                                                                    or 5
             lcp


                   -3.0   -2.5   -2.0   -1.5    -1.0   -0.5   0.0
                                 lambda (log scale)

                                 Figure: Lasso
   cooperative-Lasso                                                                                5
Toy example: the prostate dataset
         Examines the correlation between the prostate specific antigen and 8
         clinical measures for 97 patients.
         600

               age
         500




                                                                              lcavol log(cancer volume)
         400




                                                                              lweight log(prostate weight)
                                                                              age age
         300
Height




                                                                              lbph log(benign prostatic
                      pgg45
         200




                                                                              hyperplasia amount)
                                                                              svi seminal vesicle invasion
         100




                                                                              lcp log(capsular penetration)
         0




                                                                              gleason Gleason score
                                                          lweight


                                                                    gleason

                                                                              pgg45 percentage Gleason scores 4
                              lbph


                                     lcavol


                                              svi


                                                    lcp




                                                                              or 5




                     Figure: hierarchical clustering
 cooperative-Lasso                                                                                            5
Toy example: the prostate dataset
             Examines the correlation between the prostate specific antigen and 8
             clinical measures for 97 patients.
             svi




 lweight

     lcavol
                                                           lcavol log(cancer volume)
                                                           lweight log(prostate weight)
                                                           age age
coefficients




                                                           lbph log(benign prostatic
                                                           hyperplasia amount)
                                                           svi seminal vesicle invasion
                                                           lcp log(capsular penetration)
        lbph
 gleason                                                   gleason Gleason score
    pgg45
      age                                                  pgg45 percentage Gleason scores 4
                                                           or 5
             lcp


                   -3    -2            -1          0
                              lambda (log scale)

                        Figure: group-Lasso
   cooperative-Lasso                                                                       5
Toy example: the prostate dataset
             Examines the correlation between the prostate specific antigen and 8
             clinical measures for 97 patients.
             svi



 lweight

     lcavol                                                         lcavol log(cancer volume)
                                                                    lweight log(prostate weight)
                                                                    age age
coefficients




                                                                    lbph log(benign prostatic
                                                                    hyperplasia amount)
                                                                    svi seminal vesicle invasion
                                                                    lcp log(capsular penetration)
        lbph
 gleason                                                            gleason Gleason score
    pgg45
      age                                                           pgg45 percentage Gleason scores 4
                                                                    or 5
             lcp


                   -3.0   -2.5   -2.0   -1.5    -1.0   -0.5   0.0
                                 lambda (log scale)

                                 Figure: Lasso
   cooperative-Lasso                                                                                5
Application to splice site detection

                Predict splice site status (0/1) by a sequence of 7 bases and their
                interactions.
                       2




                      1.5

                                                                         order 0: 7 factors with 4 levels,
Information content




                                                                         order 1: C7 factors with 42 levels,
                                                                                   2
                       1
                                                                         order 2: C7 factors with 43 levels,
                                                                                   3


                                                                         using dummy coding for factor,
                      0.5
                                                                         we form groups.


                       0


                              1   2   3   4      5       6   7   8   9
                                              Position

                            L. Meier, S. van de Geer, P. B¨hlmann, 2008.
                                                           u
                            The group-Lasso for logistic regression, JRSS series B.

cooperative-Lasso                                                                                              6
Application to splice site detection

  Predict splice site status (0/1) by a sequence of 7 bases and their
  interactions.

                                                           order 0
    g49 g45 g61




                                                           order 1
                                                           order 2

                                                                     order 0: 7 factors with 4 levels,
    g44 g54 g42




                                                                     order 1: C7 factors with 42 levels,
                                                                               2


                                                                     order 2: C7 factors with 43 levels,
                                                                               3


                                                                     using dummy coding for factor,
    g4




                                                                     we form groups.
    g18 g5




                   -2.0   -1.5   -1.0   -0.5   0.0   0.5   1.0



                  L. Meier, S. van de Geer, P. B¨hlmann, 2008.
                                                 u
                  The group-Lasso for logistic regression, JRSS series B.

cooperative-Lasso                                                                                          6
Group-Lasso limitations

    1. Not a single zero should belong to a group with non-zeros
                    Strong group sparsity (Huang and Zhang, ’10 arXiv)
                    establish the conditions where the group-Lasso outperforms the Lasso,
                    and conversely.
    2. No sign-coherence within group
                    Required if groups gather consonant variables
                    e.g., groups defined by clusters of positively correlated variables.


  The cooperative-Lasso
  A penalty which assumes a sign-coherent group structure, that is to say,
  groups which gather either
         non-positive,
         non-negative,
         or null parameters.

cooperative-Lasso                                                                           7
Group-Lasso limitations

    1. Not a single zero should belong to a group with non-zeros
                    Strong group sparsity (Huang and Zhang, ’10 arXiv)
                    establish the conditions where the group-Lasso outperforms the Lasso,
                    and conversely.
    2. No sign-coherence within group
                    Required if groups gather consonant variables
                    e.g., groups defined by clusters of positively correlated variables.


  The cooperative-Lasso
  A penalty which assumes a sign-coherent group structure, that is to say,
  groups which gather either
         non-positive,
         non-negative,
         or null parameters.

cooperative-Lasso                                                                           7
Motivation: multiple network inference
   experiment 1                     experiment 2                   experiment 3




             inference                     inference                      inference




     A group is a set of corresponding edges across tasks (e.g., red or blue
  ones): sign-coherence matters!
         J. Chiquet, Y. Grandvalet, C. Ambroise, 2010.
         Inferring multiple graphical structures, Statistics and Computing.

cooperative-Lasso                                                                     8
Motivation: joint segmentation of aCGH profiles

                                                                                                2
                                                                        minimize β − y
                                                                       
                                                                                                    ,
                                                                       
                                                                         β∈Rp
                                                                                p
                                                                        s.t
                                                                       
                                                                                    |βi − βi−1 | < s,
                                                                               i=1
                   1




                                                               where
log-ratio (CNVs)




                                                                       y a vector in Rp ,
                                                                       β a vector in Rp ,
                   0




                                                                       βi a size-n vector with ith probes
                                                                       for the n profiles.
                                                                       a group gathers every position i
                   -1




                                                                       across profiles.

                                                                  Sign-coherence may avoid inconsistent
                                                               variations across profiles.
                   -2




                        0      50        100       150   200
                                position on chromosom

                        K. Bleakley and J.-P. Vert, 2010.
                        Joint segmentation of many aCGH profiles using fast group LARS, NIPS.
  cooperative-Lasso                                                                                         9
Motivation: joint segmentation of aCGH profiles

                                                                                             2
                                                                        minimize β − Y
                                                                       
                                                                                                 ,
                                                                        β∈Rn×p
                                                                       
                                                                                p
                                                                        s.t
                                                                       
                                                                                    βi − βi−1 < s,
                                                                               i=1
                   1




                                                               where
log-ratio (CNVs)




                                                                       Y a n × p matrix with n profiles
                                                                       with size p.
                   0




                                                                       βi a size-n vector with ith probes
                                                                       for the n profiles.
                                                                       a group gathers every position i
                   -1




                                                                       across profiles.

                                                                  Sign-coherence may avoid inconsistent
                                                               variations across profiles.
                   -2




                        0      50        100       150   200
                                position on chromosom

                        K. Bleakley and J.-P. Vert, 2010.
                        Joint segmentation of many aCGH profiles using fast group LARS, NIPS.
  cooperative-Lasso                                                                                         9
Motivation: joint segmentation of aCGH profiles

                                                                                             2
                                                                        minimize β − Y
                                                                       
                                                                                                 ,
                                                                        β∈Rn×p
                                                                       
                                                                                p
                                                                        s.t
                                                                       
                                                                                    βi − βi−1 < s,
                                                                               i=1
                   1




                                                               where
log-ratio (CNVs)




                                                                       Y a n × p matrix with n profiles
                                                                       with size p.
                   0




                                                                       βi a size-n vector with ith probes
                                                                       for the n profiles.
                                                                       a group gathers every position i
                   -1




                                                                       across profiles.

                                                                  Sign-coherence may avoid inconsistent
                                                               variations across profiles.
                   -2




                        0      50        100       150   200
                                position on chromosom

                        K. Bleakley and J.-P. Vert, 2010.
                        Joint segmentation of many aCGH profiles using fast group LARS, NIPS.
  cooperative-Lasso                                                                                         9
Motivation: joint segmentation of aCGH profiles

                                                                                             2
                                                                        minimize β − Y
                                                                       
                                                                                                 ,
                                                                        β∈Rn×p
                                                                       
                                                                                p
                                                                        s.t
                                                                       
                                                                                    βi − βi−1 < s,
                                                                               i=1
                   1




                                                               where
log-ratio (CNVs)




                                                                       Y a n × p matrix with n profiles
                                                                       with size p.
                   0




                                                                       βi a size-n vector with ith probes
                                                                       for the n profiles.
                                                                       a group gathers every position i
                   -1




                                                                       across profiles.

                                                                  Sign-coherence may avoid inconsistent
                                                               variations across profiles.
                   -2




                        0      50        100       150   200
                                position on chromosom

                        K. Bleakley and J.-P. Vert, 2010.
                        Joint segmentation of many aCGH profiles using fast group LARS, NIPS.
  cooperative-Lasso                                                                                         9
Motivation: joint segmentation of aCGH profiles

                                                                                             2
                                                                        minimize β − Y
                                                                       
                                                                                                 ,
                                                                        β∈Rn×p
                                                                       
                                                                                p
                                                                        s.t
                                                                       
                                                                                    βi − βi−1 < s,
                                                                               i=1
                   1




                                                               where
log-ratio (CNVs)




                                                                       Y a n × p matrix with n profiles
                                                                       with size p.
                   0




                                                                       βi a size-n vector with ith probes
                                                                       for the n profiles.
                                                                       a group gathers every position i
                   -1




                                                                       across profiles.

                                                                  Sign-coherence may avoid inconsistent
                                                               variations across profiles.
                   -2




                        0      50        100       150   200
                                position on chromosom

                        K. Bleakley and J.-P. Vert, 2010.
                        Joint segmentation of many aCGH profiles using fast group LARS, NIPS.
  cooperative-Lasso                                                                                         9
Motivation: joint segmentation of aCGH profiles

                                                                                             2
                                                                        minimize β − Y
                                                                       
                                                                                                 ,
                                                                        β∈Rn×p
                                                                       
                                                                                p
                                                                        s.t
                                                                       
                                                                                    βi − βi−1 < s,
                                                                               i=1
                   1




                                                               where
log-ratio (CNVs)




                                                                       Y a n × p matrix with n profiles
                                                                       with size p.
                   0




                                                                       βi a size-n vector with ith probes
                                                                       for the n profiles.
                                                                       a group gathers every position i
                   -1




                                                                       across profiles.

                                                                  Sign-coherence may avoid inconsistent
                                                               variations across profiles.
                   -2




                        0      50        100       150   200
                                position on chromosom

                        K. Bleakley and J.-P. Vert, 2010.
                        Joint segmentation of many aCGH profiles using fast group LARS, NIPS.
  cooperative-Lasso                                                                                         9
Motivation: joint segmentation of aCGH profiles

                                                                                             2
                                                                        minimize β − Y
                                                                       
                                                                                                 ,
                                                                        β∈Rn×p
                                                                       
                                                                                p
                                                                        s.t
                                                                       
                                                                                    βi − βi−1 < s,
                                                                               i=1
                   1




                                                               where
log-ratio (CNVs)




                                                                       Y a n × p matrix with n profiles
                                                                       with size p.
                   0




                                                                       βi a size-n vector with ith probes
                                                                       for the n profiles.
                                                                       a group gathers every position i
                   -1




                                                                       across profiles.

                                                                  Sign-coherence may avoid inconsistent
                                                               variations across profiles.
                   -2




                        0      50        100       150   200
                                position on chromosom

                        K. Bleakley and J.-P. Vert, 2010.
                        Joint segmentation of many aCGH profiles using fast group LARS, NIPS.
  cooperative-Lasso                                                                                         9
Outline


  Definition

  Resolution

  Consistency

  Model selection

  Simulation studies

  Sibling probe sets and gene selection




cooperative-Lasso                         10
Outline


  Definition

  Resolution

  Consistency

  Model selection

  Simulation studies

  Sibling probe sets and gene selection




cooperative-Lasso                         11
The cooperative-Lasso estimator


  Definition

              ˆcoop = arg min J(β), with J(β) = −
              β                                                  D (β)   +λ β   coop ,
                            β∈Rp

  where, for any v ∈ Rp ,
                                                             K
                                                                    +     −
                v   coop   = v+    group + v
                                             −
                                                 group   =         vGk + vGk ,
                                                             k=1

  and
                +                  +
         v+ = (v1 , . . . , vp ), vj = max(0, vj ),
                             +

                −                  +
         v− = (v1 , . . . , vp ), vj = max(0, −vj ).
                             −




cooperative-Lasso                                                                        12
A geometric view of sparsity




                                 minimize − (β1 , β2 ) + λΩ(β1 , β2 )
(β1 , β2 )




                                   β1 ,β2



                                      maximize        (β1 , β2 )
                                            β1 ,β2
                                      s.t.           Ω(β1 , β2 ) ≤ c



             β2
                        β1

cooperative-Lasso                                                       13
A geometric view of sparsity




                                minimize − (β1 , β2 ) + λΩ(β1 , β2 )
                                  β1 ,β2



                                     maximize        (β1 , β2 )
β2




                                           β1 ,β2
                                     s.t.           Ω(β1 , β2 ) ≤ c




                    β1

cooperative-Lasso                                                      13
Ball crafting: group-Lasso
                                                      β4 = 0             β4 = 0.3




Admissible set
   β = (β1 , β2 , β3 , β4 ) ,                           1                   1




                                     β2 = 0
        G1 = {1, 2}, G2 = {3, 4}.




                                                β3




                                                                   β3
                                                 −1            1    −1              1



                                                       −1                  −1
Unit ball                                              β1                  β1

                    β   group   ≤1   β2 = 0.3
                                                        1                   1



                                                β3




                                                                   β3
                                                 −1            1    −1              1



                                                       −1                  −1
                                                       β1                  β1


cooperative-Lasso                                                                       14
Ball crafting: group-Lasso
                                                      β4 = 0             β4 = 0.3




Admissible set
   β = (β1 , β2 , β3 , β4 ) ,                           1                   1




                                     β2 = 0
        G1 = {1, 2}, G2 = {3, 4}.




                                                β3




                                                                   β3
                                                 −1            1    −1              1



                                                       −1                  −1
Unit ball                                              β1                  β1

                    β   group   ≤1   β2 = 0.3
                                                        1                   1



                                                β3




                                                                   β3
                                                 −1            1    −1              1



                                                       −1                  −1
                                                       β1                  β1


cooperative-Lasso                                                                       14
Ball crafting: group-Lasso
                                                      β4 = 0             β4 = 0.3




Admissible set
   β = (β1 , β2 , β3 , β4 ) ,                           1                   1




                                     β2 = 0
        G1 = {1, 2}, G2 = {3, 4}.




                                                β3




                                                                   β3
                                                 −1            1    −1              1



                                                       −1                  −1
Unit ball                                              β1                  β1

                    β   group   ≤1   β2 = 0.3
                                                        1                   1



                                                β3




                                                                   β3
                                                 −1            1    −1              1



                                                       −1                  −1
                                                       β1                  β1


cooperative-Lasso                                                                       14
Ball crafting: group-Lasso
                                                      β4 = 0             β4 = 0.3




Admissible set
   β = (β1 , β2 , β3 , β4 ) ,                           1                   1




                                     β2 = 0
        G1 = {1, 2}, G2 = {3, 4}.




                                                β3




                                                                   β3
                                                 −1            1    −1              1



                                                       −1                  −1
Unit ball                                              β1                  β1

                    β   group   ≤1   β2 = 0.3
                                                        1                   1



                                                β3




                                                                   β3
                                                 −1            1    −1              1



                                                       −1                  −1
                                                       β1                  β1


cooperative-Lasso                                                                       14
Ball crafting: cooperative-Lasso
                                                     β4 = 0             β4 = 0.3




Admissible set
   β = (β1 , β2 , β3 , β4 ) ,                          1                   1




                                    β2 = 0
        G1 = {1, 2}, G2 = {3, 4}.




                                               β3




                                                                  β3
                                                −1            1    −1              1



                                                      −1                  −1
Unit ball                                             β1                  β1

                    β   coop   ≤1   β2 = 0.3
                                                       1                   1



                                               β3




                                                                  β3
                                                −1            1    −1              1



                                                      −1                  −1
                                                      β1                  β1


cooperative-Lasso                                                                      15
Ball crafting: cooperative-Lasso
                                                     β4 = 0             β4 = 0.3




Admissible set
   β = (β1 , β2 , β3 , β4 ) ,                          1                   1




                                    β2 = 0
        G1 = {1, 2}, G2 = {3, 4}.




                                               β3




                                                                  β3
                                                −1            1    −1              1



                                                      −1                  −1
Unit ball                                             β1                  β1

                    β   coop   ≤1   β2 = 0.3
                                                       1                   1



                                               β3




                                                                  β3
                                                −1            1    −1              1



                                                      −1                  −1
                                                      β1                  β1


cooperative-Lasso                                                                      15
Ball crafting: cooperative-Lasso
                                                     β4 = 0             β4 = 0.3




Admissible set
   β = (β1 , β2 , β3 , β4 ) ,                          1                   1




                                    β2 = 0
        G1 = {1, 2}, G2 = {3, 4}.




                                               β3




                                                                  β3
                                                −1            1    −1              1



                                                      −1                  −1
Unit ball                                             β1                  β1

                    β   coop   ≤1   β2 = 0.3
                                                       1                   1



                                               β3




                                                                  β3
                                                −1            1    −1              1



                                                      −1                  −1
                                                      β1                  β1


cooperative-Lasso                                                                      15
Ball crafting: cooperative-Lasso
                                                     β4 = 0             β4 = 0.3




Admissible set
   β = (β1 , β2 , β3 , β4 ) ,                          1                   1




                                    β2 = 0
        G1 = {1, 2}, G2 = {3, 4}.




                                               β3




                                                                  β3
                                                −1            1    −1              1



                                                      −1                  −1
Unit ball                                             β1                  β1

                    β   coop   ≤1   β2 = 0.3
                                                       1                   1



                                               β3




                                                                  β3
                                                −1            1    −1              1



                                                      −1                  −1
                                                      β1                  β1


cooperative-Lasso                                                                      15
Outline


  Definition

  Resolution

  Consistency

  Model selection

  Simulation studies

  Sibling probe sets and gene selection




cooperative-Lasso                         16
Convex analysis
 Supporting Hyperplane

  An hyperplane supports a set iff
         the set is contained in one half-space
         the set has at least one point on the hyperplane
   β2




                    β1




cooperative-Lasso                                           17
Convex analysis
 Supporting Hyperplane

  An hyperplane supports a set iff
         the set is contained in one half-space
         the set has at least one point on the hyperplane
   β2




                              β2




                    β1                     β1




cooperative-Lasso                                           17
Convex analysis
 Supporting Hyperplane

  An hyperplane supports a set iff
         the set is contained in one half-space
         the set has at least one point on the hyperplane
   β2




                              β2




                    β1                     β1




cooperative-Lasso                                           17
Convex analysis
 Supporting Hyperplane

  An hyperplane supports a set iff
         the set is contained in one half-space
         the set has at least one point on the hyperplane
   β2




                              β2




                    β1                     β1




cooperative-Lasso                                           17
Convex analysis
 Supporting Hyperplane

  An hyperplane supports a set iff
         the set is contained in one half-space
         the set has at least one point on the hyperplane
   β2




                              β2




                                                       β2
                    β1                     β1                      β1


           There are Supporting Hyperplane at all points of convex sets:
                               Generalize tangents
cooperative-Lasso                                                          17
Convex analysis
 Dual Cone and subgradient

                                  Generalizes normals
   β2




                                 β2




                                                          β2
                    β1                        β1                       β1

                                g is a subgradient at x

    the vector (g, −1) is normal to the supporting hyperplane at this point

               The subdifferential at x is the set of all subgradient at x.
cooperative-Lasso                                                             18
Convex analysis
 Dual Cone and subgradient

                                  Generalizes normals
   β2




                                 β2




                                                          β2
                    β1                        β1                       β1

                                g is a subgradient at x

    the vector (g, −1) is normal to the supporting hyperplane at this point

               The subdifferential at x is the set of all subgradient at x.
cooperative-Lasso                                                             18
Convex analysis
 Dual Cone and subgradient

                                  Generalizes normals
   β2




                                 β2




                                                          β2
                    β1                        β1                       β1

                                g is a subgradient at x

    the vector (g, −1) is normal to the supporting hyperplane at this point

               The subdifferential at x is the set of all subgradient at x.
cooperative-Lasso                                                             18
Convex analysis
 Dual Cone and subgradient

                                  Generalizes normals
   β2




                                 β2




                                                          β2
                    β1                        β1                       β1

                                g is a subgradient at x

    the vector (g, −1) is normal to the supporting hyperplane at this point

               The subdifferential at x is the set of all subgradient at x.
cooperative-Lasso                                                             18
Optimality conditions
  Theorem
  A necessary and sufficient condition for the optimality of β is that the
  null vector 0 belong to the subdifferential of the convex function J:
                      0    ∂β J(β) = {v ∈ Rp : v = −       β   (β) + λθ},

  where θ ∈ Rp belongs to the subdifferential of the coop-norm.
  Define
                          ϕj (v) = (sign(vj )v)+ ,
  then θ is such as
                                                                   βj
                    ∀k ∈ {1, . . . , K} , ∀j ∈ Sk (β) , θj =                 ,
                                                                ϕj (β Gk )
                                                c
                    ∀k ∈ {1, . . . , K} , ∀j ∈ Sk (β) ,   ϕj (θ Gk ) ≤ 1.

     We derive a subset algorithm to solve that problem (that you can
  enjoy in the paper and the package).
cooperative-Lasso                                                                19
Optimality conditions
  Theorem
  A necessary and sufficient condition for the optimality of β is that the
  null vector 0 belong to the subdifferential of the convex function J:
                      0    ∂β J(β) = {v ∈ Rp : v = −       β   (β) + λθ},

  where θ ∈ Rp belongs to the subdifferential of the coop-norm.
  Define
                          ϕj (v) = (sign(vj )v)+ ,
  then θ is such as
                                                                   βj
                    ∀k ∈ {1, . . . , K} , ∀j ∈ Sk (β) , θj =                 ,
                                                                ϕj (β Gk )
                                                c
                    ∀k ∈ {1, . . . , K} , ∀j ∈ Sk (β) ,   ϕj (θ Gk ) ≤ 1.

     We derive a subset algorithm to solve that problem (that you can
  enjoy in the paper and the package).
cooperative-Lasso                                                                19
Optimality conditions
  Theorem
  A necessary and sufficient condition for the optimality of β is that the
  null vector 0 belong to the subdifferential of the convex function J:
                      0    ∂β J(β) = {v ∈ Rp : v = −       β   (β) + λθ},

  where θ ∈ Rp belongs to the subdifferential of the coop-norm.
  Define
                          ϕj (v) = (sign(vj )v)+ ,
  then θ is such as
                                                                   βj
                    ∀k ∈ {1, . . . , K} , ∀j ∈ Sk (β) , θj =                 ,
                                                                ϕj (β Gk )
                                                c
                    ∀k ∈ {1, . . . , K} , ∀j ∈ Sk (β) ,   ϕj (θ Gk ) ≤ 1.

     We derive a subset algorithm to solve that problem (that you can
  enjoy in the paper and the package).
cooperative-Lasso                                                                19
Linear regression with orthonormal design

  Consider
                    ˆ              1          2
                    β = arg min      y − Xβ       + λΩ(β) ,
                               β   2

                                             ˆols
  with X X = I. Hence, (xj ) (Xβ − y) = βj − β and

                    ˆ              1        ˆols
                    β = arg min      β (β − β ) + λΩ(β) .
                               β   2


  We may find a closed-form of β for, e.g.,
    1. Ω(β) = β      lasso ,
    2. Ω(β) = β      group ,
    3. Ω(β) = β      coop .




cooperative-Lasso                                             20
Linear regression with orthonormal design

  Consider
                    ˆ              1          2
                    β = arg min      y − Xβ       + λΩ(β) ,
                               β   2

                                             ˆols
  with X X = I. Hence, (xj ) (Xβ − y) = βj − β and

                    ˆ              1        ˆols
                    β = arg min      β (β − β ) + λΩ(β) .
                               β   2


  We may find a closed-form of β for, e.g.,
    1. Ω(β) = β      lasso ,
    2. Ω(β) = β      group ,
    3. Ω(β) = β      coop .




cooperative-Lasso                                             20
Linear regression with orthonormal design

                    ˆlasso
                    β1


                                       ∀j ∈ {1, . . . , p} ,
                                                               +

                                          ˆlasso           λ  ˆols
                                          βj       = 1 −          βj ,
                                                          ˆ
                                                          β olsj

                                                                   +
                                            ˆlasso =
                                            βj            ˆols
                                                          βj − λ       .



            ˆols
            β2               ˆols
                             β1

  Fig.: Lasso as a function of the OLS coefficients

cooperative-Lasso                                                          20
Linear regression with orthonormal design

                    ˆgroup
                    β1


                                       ∀k ∈ {1, . . . , K} , ∀j ∈ Gk ,
                                                                 +

                                        ˆgroup = 1 −         λ  ˆols
                                        βj                           βj ,
                                                            βˆols
                                                            Gk

                                                                     +
                                          ˆgroup =
                                          β Gk            ˆols
                                                          β Gk − λ       .



            ˆols
            β2               ˆols
                             β1

  Fig.: Group-Lasso as a function of the OLS coefficients

cooperative-Lasso                                                            20
Linear regression with orthonormal design

                    ˆcoop
                    β1



                                       ∀k ∈ {1, . . . , K} , ∀j ∈ Gk ,
                                                                 +

                                    ˆcoop                   λ          ˆols
                                    βj    = 1 −              ols
                                                                   βj ,
                                                            ˆ
                                                        ϕ (β )
                                                         j   Gk

                                                                          +
                                          ˆcoop
                                      ϕj (β Gk ) =           ˆols
                                                         ϕj (β Gk ) − λ       .


            ˆols
            β2              ˆols
                            β1

  Fig.: Coop-Lasso as a function of the OLS coefficients

cooperative-Lasso                                                             20
Outline


  Definition

  Resolution

  Consistency

  Model selection

  Simulation studies

  Sibling probe sets and gene selection




cooperative-Lasso                         21
Linear regression setup
 Technical assumptions




(A1) X and Y have finite fourth order moments
                                           4
                                   E X         < ∞,     E|Y |4 < ∞,

(A2) the covariance matrix Ψ = EXX ∈ Rp×p is invertible,

(A3) for every k = 1, . . . , K,
     if (β )+ > 0 and (β )− > 0 then for every j ∈ Gk β j = 0.
         (All sign-coherent groups are either included or excluded from the true support).




cooperative-Lasso                                                                            22
Irrepresentability condition
  Define Sk = S ∩ Gk the support within a group and
                                                         −1
                          [D(β)]jj = [sign(βj )β Gk ]+        .

  Assume there exists η > 0 such that
(A4) For every group Gk including at least one null coefficient:

          max( (ΨSk S Ψ−1 D(β S )β S )+ , (ΨSk S Ψ−1 D(β S )β S )− ) ≤ 1 − η,
                  c
                       SS
                                             c
                                                  SS


(A5) For every group Gk intersecting the support and including either
     positive or negative coefficients, let νk be the sign of these
     coefficients (νk = 1 if (β Gk )+ > 0 and νk = −1 if (β Gk )− > 0):

                              νk ΨSk S Ψ−1 D(β S )β S
                                   c
                                        SS               0,

         where      denotes componentwise inequality.

cooperative-Lasso                                                               23
Consistency results



  Theorem
  If assumptions (A1-5) are satisfied and if there exists η > 0, then for
  every sequence λn such that λn = λ0 n−γ , γ ∈]0, 1/2[,

                    ˆcoop −→ β
                    β
                           P             ˆ
                                 and P(S(β
                                             coop
                                                    ) = S) → 1.




  Asymptotically, the cooperative-Lasso is unbiased and enjoys exact
  support recovery (even when there are irrelevant variables within a
  group).




cooperative-Lasso                                                          24
Sketch of the proof

                                        ˜
    1. Construct an artifical estimator β S restricted to the true support S
       and extend it with 0 coefficients on S c .
                                         ˜
    2. Consider the event En on which β satisfies the original optimality
                                   coop
                           ˜
       conditions. On En , β = β ˆ          ˆcoop
                                        and β c = 0, by uniqueness.
                                      S     S           S
    3. We need to prove that limn→∞ P(En ) = 1.
    4. Derive the asymptotic distribution of the derivative of the loss
                          ˜
       function X (y − Xβ) from
                    TCL on second order moments,
                                             ˜
                    Optimality conditions on β S .
                    Right choice of λn provides convergence in probability.
    5. Assumptions (A4-5) assume that the limits in probability satisfy
       optimality constraints with strict inequalities.
    6. As a result, optimility conditions are satisfied (with large
       inequalities) with probability tending to 1.

cooperative-Lasso                                                             25
Sketch of the proof

                                        ˜
    1. Construct an artifical estimator β S restricted to the true support S
       and extend it with 0 coefficients on S c .
                                         ˜
    2. Consider the event En on which β satisfies the original optimality
                                   coop
                           ˜
       conditions. On En , β = β ˆ          ˆcoop
                                        and β c = 0, by uniqueness.
                                      S     S           S
    3. We need to prove that limn→∞ P(En ) = 1.
    4. Derive the asymptotic distribution of the derivative of the loss
                          ˜
       function X (y − Xβ) from
                    TCL on second order moments,
                                             ˜
                    Optimality conditions on β S .
                    Right choice of λn provides convergence in probability.
    5. Assumptions (A4-5) assume that the limits in probability satisfy
       optimality constraints with strict inequalities.
    6. As a result, optimility conditions are satisfied (with large
       inequalities) with probability tending to 1.

cooperative-Lasso                                                             25
Sketch of the proof

                                        ˜
    1. Construct an artifical estimator β S restricted to the true support S
       and extend it with 0 coefficients on S c .
                                         ˜
    2. Consider the event En on which β satisfies the original optimality
                                   coop
                           ˜
       conditions. On En , β = β ˆ          ˆcoop
                                        and β c = 0, by uniqueness.
                                      S     S           S
    3. We need to prove that limn→∞ P(En ) = 1.
    4. Derive the asymptotic distribution of the derivative of the loss
                          ˜
       function X (y − Xβ) from
                    TCL on second order moments,
                                             ˜
                    Optimality conditions on β S .
                    Right choice of λn provides convergence in probability.
    5. Assumptions (A4-5) assume that the limits in probability satisfy
       optimality constraints with strict inequalities.
    6. As a result, optimility conditions are satisfied (with large
       inequalities) with probability tending to 1.

cooperative-Lasso                                                             25
Sketch of the proof

                                        ˜
    1. Construct an artifical estimator β S restricted to the true support S
       and extend it with 0 coefficients on S c .
                                         ˜
    2. Consider the event En on which β satisfies the original optimality
                                   coop
                           ˜
       conditions. On En , β = β ˆ          ˆcoop
                                        and β c = 0, by uniqueness.
                                      S     S           S
    3. We need to prove that limn→∞ P(En ) = 1.
    4. Derive the asymptotic distribution of the derivative of the loss
                          ˜
       function X (y − Xβ) from
                    TCL on second order moments,
                                             ˜
                    Optimality conditions on β S .
                    Right choice of λn provides convergence in probability.
    5. Assumptions (A4-5) assume that the limits in probability satisfy
       optimality constraints with strict inequalities.
    6. As a result, optimility conditions are satisfied (with large
       inequalities) with probability tending to 1.

cooperative-Lasso                                                             25
Illustration
         1.0
         0.5




                                                     Generate data y = Xβ + σε,
coefficients




                                                          β = (1, 1, −1, −1, 0, 0, 0, 0)
                                                          G = {{1, 2}, {3, 4}, {5, 6}, {7, 8}}
         0.0




                                                          σ = 0.1, R2 ≈ 0.99, n = 20,
                                                          irrepresentability conditions
         -0.5




                                                                 holds for the coop-Lasso,
                                                                 holds not for the group-Lasso.
                                                          average over 100 simulations.
         -1.0




                -3   -2    -1         0   1
                          log10 (λ)



      Fig.:: 50% coverage intervals (upper / lower quartiles)
cooperative-Lasso                                                                            26
Illustration
         1.0
         0.5




                                                    Generate data y = Xβ + σε,
coefficients




                                                         β = (1, 1, −1, −1, 0, 0, 0, 0)
                                                         G = {{1, 2}, {3, 4}, {5, 6}, {7, 8}}
         0.0




                                                         σ = 0.1, R2 ≈ 0.99, n = 20,
                                                         irrepresentability conditions
         -0.5




                                                                holds for the coop-Lasso,
                                                                holds not for the group-Lasso.
                                                         average over 100 simulations.
         -1.0




                -3   -2    -1         0   1
                          log10 (λ)



      Fig.:group-Lasso: 50% coverage intervals (upper / lower quartiles)
cooperative-Lasso                                                                           26
Illustration
         1.0
         0.5




                                                    Generate data y = Xβ + σε,
coefficients




                                                         β = (1, 1, −1, −1, 0, 0, 0, 0)
                                                         G = {{1, 2}, {3, 4}, {5, 6}, {7, 8}}
         0.0




                                                         σ = 0.1, R2 ≈ 0.99, n = 20,
                                                         irrepresentability conditions
         -0.5




                                                                holds for the coop-Lasso,
                                                                holds not for the group-Lasso.
                                                         average over 100 simulations.
         -1.0




                -3   -2    -1         0   1
                          log10 (λ)



      Fig.:coop-Lasso: 50% coverage intervals (upper / lower quartiles)
cooperative-Lasso                                                                           26
Outline


  Definition

  Resolution

  Consistency

  Model selection

  Simulation studies

  Sibling probe sets and gene selection




cooperative-Lasso                         27
Optimism of the training error
         The training error:
                                           1                    ˆ
                                 err =                L(yi , xi β).
                                          |D|
                                                i∈D

         The test error (“extra-sample” error):
                                                     ˆ
                                Errex = EX,Y [L(Y, X β)|D].

         The “in-sample” error
                                     1                       ˆ
                         Errin =                EY L(Yi , xi β)|D .
                                    |D|
                                          i∈D


  Definition (Optimism)

                               Errin = err + ”optimism”.

cooperative-Lasso                                                     28
Optimism of the training error
         The training error:
                                           1                    ˆ
                                 err =                L(yi , xi β).
                                          |D|
                                                i∈D

         The test error (“extra-sample” error):
                                                     ˆ
                                Errex = EX,Y [L(Y, X β)|D].

         The “in-sample” error
                                     1                       ˆ
                         Errin =                EY L(Yi , xi β)|D .
                                    |D|
                                          i∈D


  Definition (Optimism)

                               Errin = err + ”optimism”.

cooperative-Lasso                                                     28
Cp statistics
  For squared-error loss (and some other loss),
                                         2
                        Errin = err +               cov(ˆi , yi ).
                                                        y
                                        |D|
                                              i∈D


         The amount by which err underestimates the true error depends
         on how strongly yi affects its own prediction. The harder we fit
         the data, the greater the covariance will be thereby increasing
         the optimism (ESLII 5th print).

  Mallows’ Cp Statistic
                             ˆ
  For a linear regression fit yi with p inputs         i∈D   cov(ˆi , yi ) = pσ 2 :
                                                                y

                                        df 2
                      Cp = err + 2 ·       ˆ
                                           σ ,      with df = p.
                                       |D|


cooperative-Lasso                                                                    29
Cp statistics
  For squared-error loss (and some other loss),
                                         2
                        Errin = err +               cov(ˆi , yi ).
                                                        y
                                        |D|
                                              i∈D


         The amount by which err underestimates the true error depends
         on how strongly yi affects its own prediction. The harder we fit
         the data, the greater the covariance will be thereby increasing
         the optimism (ESLII 5th print).

  Mallows’ Cp Statistic
                             ˆ
  For a linear regression fit yi with p inputs         i∈D   cov(ˆi , yi ) = pσ 2 :
                                                                y

                                        df 2
                      Cp = err + 2 ·       ˆ
                                           σ ,      with df = p.
                                       |D|


cooperative-Lasso                                                                    29
Generalized degrees of freedom
      ˆ       ˆ
  Let y(λ) = Xβ(λ) be the predicted values for a penalized estimator.
  Proposition (Efron (’04)+ Stein’s Lemma (’81))

                          . 1                                      ˆ
                                                                 ∂ yλ
                    df(λ) = 2         cov(ˆi (λ), yi ) = Ey tr
                                          y                           .
                            σ                                    ∂y
                                i∈D


  For the Lasso, Zou et al. (’07) show that
                                          ˆlasso (λ)
                           df lasso (λ) = β                 .
                                                        0


  Assuming X X = I Yuan and Lin (’06) show for the group-Lasso that
  the trace term equals
                                               ˆgroup
                                                                
                    K                         β Gk (λ)
    df group (λ) =       ˆgroup
                       1 β Gk (λ) > 0 1 +               (pk − 1) .
                   k=1
                                                 β ols
                                                   Gk


cooperative-Lasso                                                         30
Generalized degrees of freedom
      ˆ       ˆ
  Let y(λ) = Xβ(λ) be the predicted values for a penalized estimator.
  Proposition (Efron (’04)+ Stein’s Lemma (’81))

                          . 1                                      ˆ
                                                                 ∂ yλ
                    df(λ) = 2         cov(ˆi (λ), yi ) = Ey tr
                                          y                           .
                            σ                                    ∂y
                                i∈D


  For the Lasso, Zou et al. (’07) show that
                                          ˆlasso (λ)
                           df lasso (λ) = β                 .
                                                        0


  Assuming X X = I Yuan and Lin (’06) show for the group-Lasso that
  the trace term equals
                                               ˆgroup
                                                                
                    K                         β Gk (λ)
    df group (λ) =       ˆgroup
                       1 β Gk (λ) > 0 1 +               (pk − 1) .
                   k=1
                                                 β ols
                                                   Gk


cooperative-Lasso                                                         30
Generalized degrees of freedom
      ˆ       ˆ
  Let y(λ) = Xβ(λ) be the predicted values for a penalized estimator.
  Proposition (Efron (’04)+ Stein’s Lemma (’81))

                          . 1                                      ˆ
                                                                 ∂ yλ
                    df(λ) = 2         cov(ˆi (λ), yi ) = Ey tr
                                          y                           .
                            σ                                    ∂y
                                i∈D


  For the Lasso, Zou et al. (’07) show that
                                          ˆlasso (λ)
                           df lasso (λ) = β                 .
                                                        0


  Assuming X X = I Yuan and Lin (’06) show for the group-Lasso that
  the trace term equals
                                               ˆgroup
                                                                
                    K                         β Gk (λ)
    df group (λ) =       ˆgroup
                       1 β Gk (λ) > 0 1 +               (pk − 1) .
                   k=1
                                                 β ols
                                                   Gk


cooperative-Lasso                                                         30
Approximated degrees of freedom for the coop-Lasso
  Proposition
  Assuming that data are generated according to a linear regression model
  and that X is orthonormal, the following expression of df coop (λ) is an
  unbiased estimate of df(λ)

                                                                                           +
                                                                                              
                         K                                                 ˆcoop
                                                                           β Gk (λ)
                                                         1 + (pk − 1)
                                                                                              
     df coop (λ) =             1             +                  +
                                                                                               
                                     ˆcoop                                             +
                                     β G (λ)        >0                        ˆols
                                                                                              
                         k=1            k                                     β Gk
                                                                                       −
                                                                                          
                                                                         ˆcoop
                                                                         β Gk (λ)
                                                     1 + (pk − 1)
                                                                                          
                     +1                    −                −
                                                                                            ,
                                   ˆcoop                                           −
                                   β G (λ)     >0                          β ols
                                                                                          
                                      k                                      Gk


  where pk and pk are respectively the number of positive and negative
           +      −
             ˆols
  entries in β (γ). Gk
cooperative-Lasso                                                                                  31
Approximated degrees of freedom for the coop-Lasso
  Proposition
  Assuming that data are generated according to a linear regression model
  and that X is orthonormal, the following expression of df coop (λ) is an
  unbiased estimate of df(λ)

                                                                                +
                                                                                   
                         K                                           ˆcoop
                                                                     β Gk (λ)
                                                            k
                                                      1 + p+ − 1
                                                                                   
     df coop (λ) =             1            +
                                                                                    
                                    ˆcoop                  1+γ                  +
                                    β G (λ)     >0                   ˆridge
                                                                                   
                         k=1           k                             β Gk (γ)
                                                                                −
                                                                                   
                                                                    ˆcoop
                                                                    β Gk (λ)
                                                           k
                                                     1 + p− − 1
                                                                                   
                         +1                −
                                                                                     ,
                                   ˆcoop                  1+γ                   −
                                   β G (λ)     >0                   ˆridge
                                                                                   
                                      k                             β Gk (γ)

   where pk and pk are respectively the number of positive and negative
            +         −
  entries in βˆridge (γ).
                    Gk
cooperative-Lasso                                                                         31
Approximated information criteria

  Following Zou et al, we extend the Cp stat to an “approximated” AIC

                                     y − y(λ)
                                         ˆ       ˜
                        AIC(λ) =              + 2df(λ),
                                        σ2
  and from the AIC, there is (small) step to BIC:

                                  y − y(λ)
                                      ˆ            ˜
                      BIC(λ) =             + log(n)df(λ).
                                     σ2


         The K–fold cross-validation works well but is computationally
         intensive.
         It is required when we do not meet the linear regression setup. . .



cooperative-Lasso                                                              32
Outline


  Definition

  Resolution

  Consistency

  Model selection

  Simulation studies

  Sibling probe sets and gene selection




cooperative-Lasso                         33
Revisiting Elastic-Net experiments (1)



                q
                                            Generate data y = Xβ + σε,
      70




                               q
                q
                q
                                                β =
                q
                q
                                                (0, . . . , 0, 2, . . . , 2, 0, . . . , 0, 2, . . . , 2)
      60




                               q
                q
                       q              q
                q
                                                     10            10            10           10
      50




                                                G1   = {1, . . . , 10},
                                                G2   = {11, . . . , 20},
MSE




                                                G3   = {21, . . . , 30},
      40




                                                G4   = {31, . . . , 40}.
                                                σ = 15, corr(xi , xj ) = 0.5,
      30




                                                training/validation/test/ =
                                                100/100/400,
      20




                       q



                                                average over 100 simulations.
      10




              lasso   enet   group   coop




cooperative-Lasso                                                                                   34
Revisiting Elastic-Net experiments (2)


                                            Generate data y = Xβ + σε,
                                                β = (3, . . . , 3, 0, . . . , 0)
                       q
      250




                                                            15          25
                       q
                       q
                                                σ = 15,
      200




                                                G1   = {1, . . . , 5},
                                                G2   = {6, . . . , 10},
                       q
                                                G3   = {11, . . . , 15},
      150




                                                G4   = {16, . . . , 40}.
MSE




                                                xj = Z1 + ε, Z1 ∼ N (0, 1), ∀j ∈ G1
      100




                               q
                               q
                               q
                               q      q
                                                xj = Z3 + ε, Z2 ∼ N (0, 1), ∀j ∈ G2
                                      q

                                                xj = Z3 + ε, Z3 ∼ N (0, 1), ∀j ∈ G3
      50




                                                xj ∼ N (0, 1), ∀j ∈ G4 .
                                                training/validation/test/ =
                                                50/50/400,
      0




              lasso   enet   group   coop       average over 100 simulations.




cooperative-Lasso                                                                  35
Breiman’s setup
 Simulations setting

  A wave-like vector of parameters β
         p = 90 variables partitioned into K = 10 groups of size pk = 9,
         3 (partially) active groups, 6 groups of zeros,
         in active groups, β j ∝ (h − |5 − j|) with h = 1, . . . , 5.




                             0     20     40    60     80


      Figure: β with h = 1, |Sk | = 1 non-zero coefficients in each active group.
cooperative-Lasso                                                                 36
Breiman’s setup
 Simulations setting

  A wave-like vector of parameters β
         p = 90 variables partitioned into K = 10 groups of size pk = 9,
         3 (partially) active groups, 6 groups of zeros,
         in active groups, β j ∝ (h − |5 − j|) with h = 1, . . . , 5.




                             0     20     40    60     80


      Figure: β with h = 2, |Sk | = 3 non-zero coefficients in each active group.
cooperative-Lasso                                                                 36
Breiman’s setup
 Simulations setting

  A wave-like vector of parameters β
         p = 90 variables partitioned into K = 10 groups of size pk = 9,
         3 (partially) active groups, 6 groups of zeros,
         in active groups, β j ∝ (h − |5 − j|) with h = 1, . . . , 5.




                             0     20     40    60     80


      Figure: β with h = 3, |Sk | = 5 non-zero coefficients in each active group.
cooperative-Lasso                                                                 36
Breiman’s setup
 Simulations setting

  A wave-like vector of parameters β
         p = 90 variables partitioned into K = 10 groups of size pk = 9,
         3 (partially) active groups, 6 groups of zeros,
         in active groups, β j ∝ (h − |5 − j|) with h = 1, . . . , 5.




                             0     20     40    60     80


      Figure: β with h = 4, |Sk | = 7 non-zero coefficients in each active group.
cooperative-Lasso                                                                 36
Breiman’s setup
 Simulations setting

  A wave-like vector of parameters β
         p = 90 variables partitioned into K = 10 groups of size pk = 9,
         3 (partially) active groups, 6 groups of zeros,
         in active groups, β j ∝ (h − |5 − j|) with h = 1, . . . , 5.




                             0     20     40    60     80


      Figure: β with h = 5, |Sk | = 9 non-zero coefficients in each active group.
cooperative-Lasso                                                                 36
Breiman’s setup
 Example of path of solution and signal recovery with BIC choice

  The signal strength is generated so as
         y = Xβ + σ , with σ = 1, n = 30 to 500,
         X ∼ N (0, Ψ) with Ψij = ρ|i−j| (ρ = 0.4 in the example),
         magnitude in β chosen so as R2 ≈ 0.75.




  Remark
         Covariance structure is purposely disconnected from the group structure.
         None of the support recovery conditions are fulfilled.




cooperative-Lasso                                                                   37
Breiman’s setup
 Example of path of solution and signal recovery with BIC choice

  The signal strength is generated so as
         y = Xβ + σ , with σ = 1, n = 30 to 500,
         X ∼ N (0, Ψ) with Ψij = ρ|i−j| (ρ = 0.4 in the example),
         magnitude in β chosen so as R2 ≈ 0.75.




                         One shot sample with n = 120




cooperative-Lasso                                                   37
Breiman’s setup
 Example of path of solution and signal recovery with BIC choice

  The signal strength is generated so as
                   y = Xβ + σ , with σ = 1, n = 30 to 500,
                   X ∼ N (0, Ψ) with Ψij = ρ|i−j| (ρ = 0.4 in the example),
                   magnitude in β chosen so as R2 ≈ 0.75.
          0.6




                                                                             0.5
          0.4




                                                                             0.4
                                                                             0.3
 ˆlasso




                                                                    ˆlasso
          0.2




                                                                                                                True signal
                                                                             0.2
 β




                                                                    β



                                                                                                                Estimated signal
                                                                             0.1
          0.0




                                                                             0.0
          -0.2




                                                                             -0.1




                 -0.4   -0.2   0.0    0.2   0.4   0.6   0.8   1.0                   0   20   40       60   80
                                     log10 (λ)                                                    i
                                                                               Figure: Lasso
cooperative-Lasso                                                                                                             37
Breiman’s setup
 Example of path of solution and signal recovery with BIC choice

  The signal strength is generated so as
                   y = Xβ + σ , with σ = 1, n = 30 to 500,
                   X ∼ N (0, Ψ) with Ψij = ρ|i−j| (ρ = 0.4 in the example),
                   magnitude in β chosen so as R2 ≈ 0.75.
          0.5




                                                                   0.5
          0.4




                                                                   0.4
          0.3




                                                                   0.3
 ˆgroup




                                                          ˆgroup
          0.2




                                                                                                      True signal
                                                                   0.2
 β




                                                          β
          0.1




                                                                                                      Estimated signal
                                                                   0.1
          0.0




                                                                   0.0
          -0.1




                                                                   -0.1




                 -0.4   -0.2   0.0   0.2    0.4   0.6   0.8               0   20   40       60   80
                                log10 (λ)                                               i
                                                              Figure: Group-Lasso
cooperative-Lasso                                                                                                   37
Breiman’s setup
 Example of path of solution and signal recovery with BIC choice

  The signal strength is generated so as
                  y = Xβ + σ , with σ = 1, n = 30 to 500,
                  X ∼ N (0, Ψ) with Ψij = ρ|i−j| (ρ = 0.4 in the example),
                  magnitude in β chosen so as R2 ≈ 0.75.
         0.5




                                                                 0.5
         0.4




                                                                 0.4
         0.3




                                                                 0.3
 ˆcoop




                                                         ˆcoop


                                                                                                    True signal
         0.2




                                                                 0.2
 β




                                                         β



                                                                                                    Estimated signal
         0.1




                                                                 0.1
         0.0




                                                                 0.0
         -0.1




                                                                 -0.1




                -0.4   -0.2   0.0   0.2    0.4   0.6   0.8              0   20   40       60   80
                               log10 (λ)                                              i
                                                             Figure: Coop-Lasso
cooperative-Lasso                                                                                                 37
Breiman’s setup
 Errors as a function of the sample size n




                                                                               0.30
                       1.2




                                                                               0.25
                       1.0
    prediction error




                                                                               0.20
                       0.8




                                                                  sign error
                                                                               0.15
                       0.6




                                                                               0.10
                       0.4




                                                                               0.05
                       0.2




                                                                               0.00
                       0.0




                             100   200       300      400   500                       100    200       300   400   500

                                         n                                                         n
                                     Figure: h = 3, |Sk | = 5 (favoring Lasso).


                                              lasso                group                    coop


cooperative-Lasso                                                                                                        38
Breiman’s setup
 Errors as a function of the sample size n




                                                                                0.30
                       1.2




                                                                                0.25
                       1.0
    prediction error




                                                                                0.20
                       0.8




                                                                   sign error
                                                                                0.15
                       0.6




                                                                                0.10
                       0.4




                                                                                0.05
                       0.2




                                                                                0.00
                       0.0




                             100   200        300      400   500                       100    200       300   400   500

                                          n                                                         n
                                         Figure: h = 4, |Sk | = 7 (intermediate).


                                               lasso                group                    coop


cooperative-Lasso                                                                                                         38
Cooperative-Lasso for sparse groups
Cooperative-Lasso for sparse groups
Cooperative-Lasso for sparse groups
Cooperative-Lasso for sparse groups
Cooperative-Lasso for sparse groups
Cooperative-Lasso for sparse groups
Cooperative-Lasso for sparse groups
Cooperative-Lasso for sparse groups
Cooperative-Lasso for sparse groups
Cooperative-Lasso for sparse groups
Cooperative-Lasso for sparse groups
Cooperative-Lasso for sparse groups

Contenu connexe

Tendances

Chapter 3 projection
Chapter 3 projectionChapter 3 projection
Chapter 3 projectionNBER
 
Jump-growth model for predator-prey dynamics
Jump-growth model for predator-prey dynamicsJump-growth model for predator-prey dynamics
Jump-growth model for predator-prey dynamicsgustavdelius
 
Lecture on solving1
Lecture on solving1Lecture on solving1
Lecture on solving1NBER
 
Lesson 16 The Spectral Theorem and Applications
Lesson 16  The Spectral Theorem and ApplicationsLesson 16  The Spectral Theorem and Applications
Lesson 16 The Spectral Theorem and ApplicationsMatthew Leingang
 
Bayesian regression models and treed Gaussian process models
Bayesian regression models and treed Gaussian process modelsBayesian regression models and treed Gaussian process models
Bayesian regression models and treed Gaussian process modelsTommaso Rigon
 
Chapter 4 likelihood
Chapter 4 likelihoodChapter 4 likelihood
Chapter 4 likelihoodNBER
 
Optimalpolicyhandout
OptimalpolicyhandoutOptimalpolicyhandout
OptimalpolicyhandoutNBER
 
CMA-ES with local meta-models
CMA-ES with local meta-modelsCMA-ES with local meta-models
CMA-ES with local meta-modelszyedb
 
Alternating direction
Alternating directionAlternating direction
Alternating directionDerek Pang
 
CVPR2010: Advanced ITinCVPR in a Nutshell: part 7: Future Trend
CVPR2010: Advanced ITinCVPR in a Nutshell: part 7: Future TrendCVPR2010: Advanced ITinCVPR in a Nutshell: part 7: Future Trend
CVPR2010: Advanced ITinCVPR in a Nutshell: part 7: Future Trendzukun
 
Model For Estimating Diversity Presentation
Model For Estimating Diversity PresentationModel For Estimating Diversity Presentation
Model For Estimating Diversity PresentationDavid Torres
 
M. Visinescu - Higher Order First Integrals, Killing Tensors, Killing-Maxwell...
M. Visinescu - Higher Order First Integrals, Killing Tensors, Killing-Maxwell...M. Visinescu - Higher Order First Integrals, Killing Tensors, Killing-Maxwell...
M. Visinescu - Higher Order First Integrals, Killing Tensors, Killing-Maxwell...SEENET-MTP
 
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...Beniamino Murgante
 
Logit stick-breaking priors for partially exchangeable count data
Logit stick-breaking priors for partially exchangeable count dataLogit stick-breaking priors for partially exchangeable count data
Logit stick-breaking priors for partially exchangeable count dataTommaso Rigon
 
Asymptotics for discrete random measures
Asymptotics for discrete random measuresAsymptotics for discrete random measures
Asymptotics for discrete random measuresJulyan Arbel
 
B. Sazdovic - Noncommutativity and T-duality
B. Sazdovic - Noncommutativity and T-dualityB. Sazdovic - Noncommutativity and T-duality
B. Sazdovic - Noncommutativity and T-dualitySEENET-MTP
 
CVPR2010: Sparse Coding and Dictionary Learning for Image Analysis: Part 1: S...
CVPR2010: Sparse Coding and Dictionary Learning for Image Analysis: Part 1: S...CVPR2010: Sparse Coding and Dictionary Learning for Image Analysis: Part 1: S...
CVPR2010: Sparse Coding and Dictionary Learning for Image Analysis: Part 1: S...zukun
 
PAWL - GPU meeting @ Warwick
PAWL - GPU meeting @ WarwickPAWL - GPU meeting @ Warwick
PAWL - GPU meeting @ WarwickPierre Jacob
 
Particle filter
Particle filterParticle filter
Particle filterbugway
 
Ml mle_bayes
Ml  mle_bayesMl  mle_bayes
Ml mle_bayesPhong Vo
 

Tendances (20)

Chapter 3 projection
Chapter 3 projectionChapter 3 projection
Chapter 3 projection
 
Jump-growth model for predator-prey dynamics
Jump-growth model for predator-prey dynamicsJump-growth model for predator-prey dynamics
Jump-growth model for predator-prey dynamics
 
Lecture on solving1
Lecture on solving1Lecture on solving1
Lecture on solving1
 
Lesson 16 The Spectral Theorem and Applications
Lesson 16  The Spectral Theorem and ApplicationsLesson 16  The Spectral Theorem and Applications
Lesson 16 The Spectral Theorem and Applications
 
Bayesian regression models and treed Gaussian process models
Bayesian regression models and treed Gaussian process modelsBayesian regression models and treed Gaussian process models
Bayesian regression models and treed Gaussian process models
 
Chapter 4 likelihood
Chapter 4 likelihoodChapter 4 likelihood
Chapter 4 likelihood
 
Optimalpolicyhandout
OptimalpolicyhandoutOptimalpolicyhandout
Optimalpolicyhandout
 
CMA-ES with local meta-models
CMA-ES with local meta-modelsCMA-ES with local meta-models
CMA-ES with local meta-models
 
Alternating direction
Alternating directionAlternating direction
Alternating direction
 
CVPR2010: Advanced ITinCVPR in a Nutshell: part 7: Future Trend
CVPR2010: Advanced ITinCVPR in a Nutshell: part 7: Future TrendCVPR2010: Advanced ITinCVPR in a Nutshell: part 7: Future Trend
CVPR2010: Advanced ITinCVPR in a Nutshell: part 7: Future Trend
 
Model For Estimating Diversity Presentation
Model For Estimating Diversity PresentationModel For Estimating Diversity Presentation
Model For Estimating Diversity Presentation
 
M. Visinescu - Higher Order First Integrals, Killing Tensors, Killing-Maxwell...
M. Visinescu - Higher Order First Integrals, Killing Tensors, Killing-Maxwell...M. Visinescu - Higher Order First Integrals, Killing Tensors, Killing-Maxwell...
M. Visinescu - Higher Order First Integrals, Killing Tensors, Killing-Maxwell...
 
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
Kernel based models for geo- and environmental sciences- Alexei Pozdnoukhov –...
 
Logit stick-breaking priors for partially exchangeable count data
Logit stick-breaking priors for partially exchangeable count dataLogit stick-breaking priors for partially exchangeable count data
Logit stick-breaking priors for partially exchangeable count data
 
Asymptotics for discrete random measures
Asymptotics for discrete random measuresAsymptotics for discrete random measures
Asymptotics for discrete random measures
 
B. Sazdovic - Noncommutativity and T-duality
B. Sazdovic - Noncommutativity and T-dualityB. Sazdovic - Noncommutativity and T-duality
B. Sazdovic - Noncommutativity and T-duality
 
CVPR2010: Sparse Coding and Dictionary Learning for Image Analysis: Part 1: S...
CVPR2010: Sparse Coding and Dictionary Learning for Image Analysis: Part 1: S...CVPR2010: Sparse Coding and Dictionary Learning for Image Analysis: Part 1: S...
CVPR2010: Sparse Coding and Dictionary Learning for Image Analysis: Part 1: S...
 
PAWL - GPU meeting @ Warwick
PAWL - GPU meeting @ WarwickPAWL - GPU meeting @ Warwick
PAWL - GPU meeting @ Warwick
 
Particle filter
Particle filterParticle filter
Particle filter
 
Ml mle_bayes
Ml  mle_bayesMl  mle_bayes
Ml mle_bayes
 

Similaire à Cooperative-Lasso for sparse groups

Physics of Algorithms Talk
Physics of Algorithms TalkPhysics of Algorithms Talk
Physics of Algorithms Talkjasonj383
 
Talk at CIRM on Poisson equation and debiasing techniques
Talk at CIRM on Poisson equation and debiasing techniquesTalk at CIRM on Poisson equation and debiasing techniques
Talk at CIRM on Poisson equation and debiasing techniquesPierre Jacob
 
Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)Sangwoo Mo
 
Uncertainty in deep learning
Uncertainty in deep learningUncertainty in deep learning
Uncertainty in deep learningYujiro Katagiri
 
Integration of biological annotations using hierarchical modeling
Integration of biological annotations using hierarchical modelingIntegration of biological annotations using hierarchical modeling
Integration of biological annotations using hierarchical modelingUSC
 
Aggressive Sampling for Multi-class to Binary Reduction with Applications to ...
Aggressive Sampling for Multi-class to Binary Reduction with Applications to ...Aggressive Sampling for Multi-class to Binary Reduction with Applications to ...
Aggressive Sampling for Multi-class to Binary Reduction with Applications to ...Ioannis Partalas
 
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7Ono Shigeru
 
Machine learning (2)
Machine learning (2)Machine learning (2)
Machine learning (2)NYversity
 
Basics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programmingBasics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programmingSSA KPI
 
Regularization and variable selection via elastic net
Regularization and variable selection via elastic netRegularization and variable selection via elastic net
Regularization and variable selection via elastic netKyusonLim
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Valentin De Bortoli
 
A discussion on sampling graphs to approximate network classification functions
A discussion on sampling graphs to approximate network classification functionsA discussion on sampling graphs to approximate network classification functions
A discussion on sampling graphs to approximate network classification functionsLARCA UPC
 
1 hofstad
1 hofstad1 hofstad
1 hofstadYandex
 

Similaire à Cooperative-Lasso for sparse groups (20)

Physics of Algorithms Talk
Physics of Algorithms TalkPhysics of Algorithms Talk
Physics of Algorithms Talk
 
Talk at CIRM on Poisson equation and debiasing techniques
Talk at CIRM on Poisson equation and debiasing techniquesTalk at CIRM on Poisson equation and debiasing techniques
Talk at CIRM on Poisson equation and debiasing techniques
 
JISA_Paper
JISA_PaperJISA_Paper
JISA_Paper
 
Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)Improved Trainings of Wasserstein GANs (WGAN-GP)
Improved Trainings of Wasserstein GANs (WGAN-GP)
 
Uncertainty in deep learning
Uncertainty in deep learningUncertainty in deep learning
Uncertainty in deep learning
 
Multitask learning for GGM
Multitask learning for GGMMultitask learning for GGM
Multitask learning for GGM
 
Integration of biological annotations using hierarchical modeling
Integration of biological annotations using hierarchical modelingIntegration of biological annotations using hierarchical modeling
Integration of biological annotations using hierarchical modeling
 
Aggressive Sampling for Multi-class to Binary Reduction with Applications to ...
Aggressive Sampling for Multi-class to Binary Reduction with Applications to ...Aggressive Sampling for Multi-class to Binary Reduction with Applications to ...
Aggressive Sampling for Multi-class to Binary Reduction with Applications to ...
 
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
Goodfellow, Bengio, Couville (2016) "Deep Learning", Chap. 7
 
Lecture11
Lecture11Lecture11
Lecture11
 
YSC 2013
YSC 2013YSC 2013
YSC 2013
 
NTU_paper
NTU_paperNTU_paper
NTU_paper
 
MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...
MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...
MUMS Opening Workshop - Panel Discussion: Facts About Some Statisitcal Models...
 
Machine learning (2)
Machine learning (2)Machine learning (2)
Machine learning (2)
 
Basics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programmingBasics of probability in statistical simulation and stochastic programming
Basics of probability in statistical simulation and stochastic programming
 
Regularization and variable selection via elastic net
Regularization and variable selection via elastic netRegularization and variable selection via elastic net
Regularization and variable selection via elastic net
 
Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...Maximum likelihood estimation of regularisation parameters in inverse problem...
Maximum likelihood estimation of regularisation parameters in inverse problem...
 
A discussion on sampling graphs to approximate network classification functions
A discussion on sampling graphs to approximate network classification functionsA discussion on sampling graphs to approximate network classification functions
A discussion on sampling graphs to approximate network classification functions
 
1 hofstad
1 hofstad1 hofstad
1 hofstad
 
E212126
E212126E212126
E212126
 

Cooperative-Lasso for sparse groups

  • 1. Sparsity with sign-coherent groups of variables via the cooperative-Lasso Julien Chiquet1 , Yves Grandvalet2 , Camille Charbonnier1 1 e ´ Statistique et G´nome, CNRS & Universit´ d’Evry Val d’Essonne e 2 Heudiasyc, CNRS & Universit´ de Technologie de Compi`gne e e SSB – 29 mars 2011 arXiv preprint. http://arxiv.org/abs/1103.2697 R-package scoop. http://stat.genopole.cnrs.fr/logiciels/scoop cooperative-Lasso 1
  • 2. Notations Let Y be the output random variable, X = (X 1 , . . . , X p ) be the input random variables, where X j is the jth predictor. The data Given a sample {(yi , xi ), i = 1, . . . , n} of i.id. realizations of (Y, X), denote y = (y1 , . . . , yn ) the response vector, xj = (xj , . . . , xj ) the vector of data for the jth predictor, 1 n X the n × p design matrix of data whose jth column is xj , D = {i : (yi , xi ) ∈ training set}, T = {i : (yi , xi ) ∈ test set}. cooperative-Lasso 2
  • 3. Generalized linear models Suppose Y depends linearly on X through a function g: E(Y ) = g(Xβ ). ˆ We predict a response yi by yi = g(xi β) for any i ∈ T by solving ˆ ˆ β = arg max D (β) = arg min Lg (yi , xi β), β β i∈D where Lg is a loss function depending on the function g. Typically, if Y is Gaussian and g = Id (OLS), Lg (y, xβ) = (y − xβ)2 if Y is binary and g : t → g(t) = (1 + e−t )−1 (logistic regression) Lg (y, xβ) = − y · xβ − log 1 + exβ or any negative log-likelihood of an exponential family distribution. cooperative-Lasso 3
  • 4. Generalized linear models Suppose Y depends linearly on X through a function g: E(Y ) = g(Xβ ). ˆ We predict a response yi by yi = g(xi β) for any i ∈ T by solving ˆ ˆ β = arg max D (β) = arg min Lg (yi , xi β), β β i∈D where Lg is a loss function depending on the function g. Typically, if Y is Gaussian and g = Id (OLS), Lg (y, xβ) = (y − xβ)2 if Y is binary and g : t → g(t) = (1 + e−t )−1 (logistic regression) Lg (y, xβ) = − y · xβ − log 1 + exβ or any negative log-likelihood of an exponential family distribution. cooperative-Lasso 3
  • 5. Estimation and selection at the group level 1. Structure: the set I = {1, . . . , p} splits into a known partition. K I= Gk , with Gk ∩ G = ∅, k = . k=1 2. Sparsity: the support S of β has few entries. S = {i : βi = 0}, such as |S| p. The group-Lasso estimator Grandvalet and Canu ’98, Bakin ’99, Yuan and Lin ’06 K ˆgroup = arg min − β D (β) +λ wk β Gk . β∈Rp k=1 λ ≥ 0 controls the overall amount of penalty, wk > 0 adapts the penalty between groups (dropped hereafter). cooperative-Lasso 4
  • 6. Estimation and selection at the group level 1. Structure: the set I = {1, . . . , p} splits into a known partition. K I= Gk , with Gk ∩ G = ∅, k = . k=1 2. Sparsity: the support S of β has few entries. S = {i : βi = 0}, such as |S| p. The group-Lasso estimator Grandvalet and Canu ’98, Bakin ’99, Yuan and Lin ’06 K ˆgroup = arg min − β D (β) +λ wk β Gk . β∈Rp k=1 λ ≥ 0 controls the overall amount of penalty, wk > 0 adapts the penalty between groups (dropped hereafter). cooperative-Lasso 4
  • 7. Toy example: the prostate dataset Examines the correlation between the prostate specific antigen and 8 clinical measures for 97 patients. svi lweight lcavol lcavol log(cancer volume) lweight log(prostate weight) age age coefficients lbph log(benign prostatic hyperplasia amount) svi seminal vesicle invasion lcp log(capsular penetration) lbph gleason gleason Gleason score pgg45 age pgg45 percentage Gleason scores 4 or 5 lcp -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 lambda (log scale) Figure: Lasso cooperative-Lasso 5
  • 8. Toy example: the prostate dataset Examines the correlation between the prostate specific antigen and 8 clinical measures for 97 patients. 600 age 500 lcavol log(cancer volume) 400 lweight log(prostate weight) age age 300 Height lbph log(benign prostatic pgg45 200 hyperplasia amount) svi seminal vesicle invasion 100 lcp log(capsular penetration) 0 gleason Gleason score lweight gleason pgg45 percentage Gleason scores 4 lbph lcavol svi lcp or 5 Figure: hierarchical clustering cooperative-Lasso 5
  • 9. Toy example: the prostate dataset Examines the correlation between the prostate specific antigen and 8 clinical measures for 97 patients. svi lweight lcavol lcavol log(cancer volume) lweight log(prostate weight) age age coefficients lbph log(benign prostatic hyperplasia amount) svi seminal vesicle invasion lcp log(capsular penetration) lbph gleason gleason Gleason score pgg45 age pgg45 percentage Gleason scores 4 or 5 lcp -3 -2 -1 0 lambda (log scale) Figure: group-Lasso cooperative-Lasso 5
  • 10. Toy example: the prostate dataset Examines the correlation between the prostate specific antigen and 8 clinical measures for 97 patients. svi lweight lcavol lcavol log(cancer volume) lweight log(prostate weight) age age coefficients lbph log(benign prostatic hyperplasia amount) svi seminal vesicle invasion lcp log(capsular penetration) lbph gleason gleason Gleason score pgg45 age pgg45 percentage Gleason scores 4 or 5 lcp -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 0.0 lambda (log scale) Figure: Lasso cooperative-Lasso 5
  • 11. Application to splice site detection Predict splice site status (0/1) by a sequence of 7 bases and their interactions. 2 1.5 order 0: 7 factors with 4 levels, Information content order 1: C7 factors with 42 levels, 2 1 order 2: C7 factors with 43 levels, 3 using dummy coding for factor, 0.5 we form groups. 0 1 2 3 4 5 6 7 8 9 Position L. Meier, S. van de Geer, P. B¨hlmann, 2008. u The group-Lasso for logistic regression, JRSS series B. cooperative-Lasso 6
  • 12. Application to splice site detection Predict splice site status (0/1) by a sequence of 7 bases and their interactions. order 0 g49 g45 g61 order 1 order 2 order 0: 7 factors with 4 levels, g44 g54 g42 order 1: C7 factors with 42 levels, 2 order 2: C7 factors with 43 levels, 3 using dummy coding for factor, g4 we form groups. g18 g5 -2.0 -1.5 -1.0 -0.5 0.0 0.5 1.0 L. Meier, S. van de Geer, P. B¨hlmann, 2008. u The group-Lasso for logistic regression, JRSS series B. cooperative-Lasso 6
  • 13. Group-Lasso limitations 1. Not a single zero should belong to a group with non-zeros Strong group sparsity (Huang and Zhang, ’10 arXiv) establish the conditions where the group-Lasso outperforms the Lasso, and conversely. 2. No sign-coherence within group Required if groups gather consonant variables e.g., groups defined by clusters of positively correlated variables. The cooperative-Lasso A penalty which assumes a sign-coherent group structure, that is to say, groups which gather either non-positive, non-negative, or null parameters. cooperative-Lasso 7
  • 14. Group-Lasso limitations 1. Not a single zero should belong to a group with non-zeros Strong group sparsity (Huang and Zhang, ’10 arXiv) establish the conditions where the group-Lasso outperforms the Lasso, and conversely. 2. No sign-coherence within group Required if groups gather consonant variables e.g., groups defined by clusters of positively correlated variables. The cooperative-Lasso A penalty which assumes a sign-coherent group structure, that is to say, groups which gather either non-positive, non-negative, or null parameters. cooperative-Lasso 7
  • 15. Motivation: multiple network inference experiment 1 experiment 2 experiment 3 inference inference inference A group is a set of corresponding edges across tasks (e.g., red or blue ones): sign-coherence matters! J. Chiquet, Y. Grandvalet, C. Ambroise, 2010. Inferring multiple graphical structures, Statistics and Computing. cooperative-Lasso 8
  • 16. Motivation: joint segmentation of aCGH profiles 2  minimize β − y  ,   β∈Rp p  s.t   |βi − βi−1 | < s, i=1 1 where log-ratio (CNVs) y a vector in Rp , β a vector in Rp , 0 βi a size-n vector with ith probes for the n profiles. a group gathers every position i -1 across profiles. Sign-coherence may avoid inconsistent variations across profiles. -2 0 50 100 150 200 position on chromosom K. Bleakley and J.-P. Vert, 2010. Joint segmentation of many aCGH profiles using fast group LARS, NIPS. cooperative-Lasso 9
  • 17. Motivation: joint segmentation of aCGH profiles 2  minimize β − Y  ,  β∈Rn×p  p  s.t   βi − βi−1 < s, i=1 1 where log-ratio (CNVs) Y a n × p matrix with n profiles with size p. 0 βi a size-n vector with ith probes for the n profiles. a group gathers every position i -1 across profiles. Sign-coherence may avoid inconsistent variations across profiles. -2 0 50 100 150 200 position on chromosom K. Bleakley and J.-P. Vert, 2010. Joint segmentation of many aCGH profiles using fast group LARS, NIPS. cooperative-Lasso 9
  • 18. Motivation: joint segmentation of aCGH profiles 2  minimize β − Y  ,  β∈Rn×p  p  s.t   βi − βi−1 < s, i=1 1 where log-ratio (CNVs) Y a n × p matrix with n profiles with size p. 0 βi a size-n vector with ith probes for the n profiles. a group gathers every position i -1 across profiles. Sign-coherence may avoid inconsistent variations across profiles. -2 0 50 100 150 200 position on chromosom K. Bleakley and J.-P. Vert, 2010. Joint segmentation of many aCGH profiles using fast group LARS, NIPS. cooperative-Lasso 9
  • 19. Motivation: joint segmentation of aCGH profiles 2  minimize β − Y  ,  β∈Rn×p  p  s.t   βi − βi−1 < s, i=1 1 where log-ratio (CNVs) Y a n × p matrix with n profiles with size p. 0 βi a size-n vector with ith probes for the n profiles. a group gathers every position i -1 across profiles. Sign-coherence may avoid inconsistent variations across profiles. -2 0 50 100 150 200 position on chromosom K. Bleakley and J.-P. Vert, 2010. Joint segmentation of many aCGH profiles using fast group LARS, NIPS. cooperative-Lasso 9
  • 20. Motivation: joint segmentation of aCGH profiles 2  minimize β − Y  ,  β∈Rn×p  p  s.t   βi − βi−1 < s, i=1 1 where log-ratio (CNVs) Y a n × p matrix with n profiles with size p. 0 βi a size-n vector with ith probes for the n profiles. a group gathers every position i -1 across profiles. Sign-coherence may avoid inconsistent variations across profiles. -2 0 50 100 150 200 position on chromosom K. Bleakley and J.-P. Vert, 2010. Joint segmentation of many aCGH profiles using fast group LARS, NIPS. cooperative-Lasso 9
  • 21. Motivation: joint segmentation of aCGH profiles 2  minimize β − Y  ,  β∈Rn×p  p  s.t   βi − βi−1 < s, i=1 1 where log-ratio (CNVs) Y a n × p matrix with n profiles with size p. 0 βi a size-n vector with ith probes for the n profiles. a group gathers every position i -1 across profiles. Sign-coherence may avoid inconsistent variations across profiles. -2 0 50 100 150 200 position on chromosom K. Bleakley and J.-P. Vert, 2010. Joint segmentation of many aCGH profiles using fast group LARS, NIPS. cooperative-Lasso 9
  • 22. Motivation: joint segmentation of aCGH profiles 2  minimize β − Y  ,  β∈Rn×p  p  s.t   βi − βi−1 < s, i=1 1 where log-ratio (CNVs) Y a n × p matrix with n profiles with size p. 0 βi a size-n vector with ith probes for the n profiles. a group gathers every position i -1 across profiles. Sign-coherence may avoid inconsistent variations across profiles. -2 0 50 100 150 200 position on chromosom K. Bleakley and J.-P. Vert, 2010. Joint segmentation of many aCGH profiles using fast group LARS, NIPS. cooperative-Lasso 9
  • 23. Outline Definition Resolution Consistency Model selection Simulation studies Sibling probe sets and gene selection cooperative-Lasso 10
  • 24. Outline Definition Resolution Consistency Model selection Simulation studies Sibling probe sets and gene selection cooperative-Lasso 11
  • 25. The cooperative-Lasso estimator Definition ˆcoop = arg min J(β), with J(β) = − β D (β) +λ β coop , β∈Rp where, for any v ∈ Rp , K + − v coop = v+ group + v − group = vGk + vGk , k=1 and + + v+ = (v1 , . . . , vp ), vj = max(0, vj ), + − + v− = (v1 , . . . , vp ), vj = max(0, −vj ). − cooperative-Lasso 12
  • 26. A geometric view of sparsity minimize − (β1 , β2 ) + λΩ(β1 , β2 ) (β1 , β2 ) β1 ,β2 maximize (β1 , β2 ) β1 ,β2 s.t. Ω(β1 , β2 ) ≤ c β2 β1 cooperative-Lasso 13
  • 27. A geometric view of sparsity minimize − (β1 , β2 ) + λΩ(β1 , β2 ) β1 ,β2 maximize (β1 , β2 ) β2 β1 ,β2 s.t. Ω(β1 , β2 ) ≤ c β1 cooperative-Lasso 13
  • 28. Ball crafting: group-Lasso β4 = 0 β4 = 0.3 Admissible set β = (β1 , β2 , β3 , β4 ) , 1 1 β2 = 0 G1 = {1, 2}, G2 = {3, 4}. β3 β3 −1 1 −1 1 −1 −1 Unit ball β1 β1 β group ≤1 β2 = 0.3 1 1 β3 β3 −1 1 −1 1 −1 −1 β1 β1 cooperative-Lasso 14
  • 29. Ball crafting: group-Lasso β4 = 0 β4 = 0.3 Admissible set β = (β1 , β2 , β3 , β4 ) , 1 1 β2 = 0 G1 = {1, 2}, G2 = {3, 4}. β3 β3 −1 1 −1 1 −1 −1 Unit ball β1 β1 β group ≤1 β2 = 0.3 1 1 β3 β3 −1 1 −1 1 −1 −1 β1 β1 cooperative-Lasso 14
  • 30. Ball crafting: group-Lasso β4 = 0 β4 = 0.3 Admissible set β = (β1 , β2 , β3 , β4 ) , 1 1 β2 = 0 G1 = {1, 2}, G2 = {3, 4}. β3 β3 −1 1 −1 1 −1 −1 Unit ball β1 β1 β group ≤1 β2 = 0.3 1 1 β3 β3 −1 1 −1 1 −1 −1 β1 β1 cooperative-Lasso 14
  • 31. Ball crafting: group-Lasso β4 = 0 β4 = 0.3 Admissible set β = (β1 , β2 , β3 , β4 ) , 1 1 β2 = 0 G1 = {1, 2}, G2 = {3, 4}. β3 β3 −1 1 −1 1 −1 −1 Unit ball β1 β1 β group ≤1 β2 = 0.3 1 1 β3 β3 −1 1 −1 1 −1 −1 β1 β1 cooperative-Lasso 14
  • 32. Ball crafting: cooperative-Lasso β4 = 0 β4 = 0.3 Admissible set β = (β1 , β2 , β3 , β4 ) , 1 1 β2 = 0 G1 = {1, 2}, G2 = {3, 4}. β3 β3 −1 1 −1 1 −1 −1 Unit ball β1 β1 β coop ≤1 β2 = 0.3 1 1 β3 β3 −1 1 −1 1 −1 −1 β1 β1 cooperative-Lasso 15
  • 33. Ball crafting: cooperative-Lasso β4 = 0 β4 = 0.3 Admissible set β = (β1 , β2 , β3 , β4 ) , 1 1 β2 = 0 G1 = {1, 2}, G2 = {3, 4}. β3 β3 −1 1 −1 1 −1 −1 Unit ball β1 β1 β coop ≤1 β2 = 0.3 1 1 β3 β3 −1 1 −1 1 −1 −1 β1 β1 cooperative-Lasso 15
  • 34. Ball crafting: cooperative-Lasso β4 = 0 β4 = 0.3 Admissible set β = (β1 , β2 , β3 , β4 ) , 1 1 β2 = 0 G1 = {1, 2}, G2 = {3, 4}. β3 β3 −1 1 −1 1 −1 −1 Unit ball β1 β1 β coop ≤1 β2 = 0.3 1 1 β3 β3 −1 1 −1 1 −1 −1 β1 β1 cooperative-Lasso 15
  • 35. Ball crafting: cooperative-Lasso β4 = 0 β4 = 0.3 Admissible set β = (β1 , β2 , β3 , β4 ) , 1 1 β2 = 0 G1 = {1, 2}, G2 = {3, 4}. β3 β3 −1 1 −1 1 −1 −1 Unit ball β1 β1 β coop ≤1 β2 = 0.3 1 1 β3 β3 −1 1 −1 1 −1 −1 β1 β1 cooperative-Lasso 15
  • 36. Outline Definition Resolution Consistency Model selection Simulation studies Sibling probe sets and gene selection cooperative-Lasso 16
  • 37. Convex analysis Supporting Hyperplane An hyperplane supports a set iff the set is contained in one half-space the set has at least one point on the hyperplane β2 β1 cooperative-Lasso 17
  • 38. Convex analysis Supporting Hyperplane An hyperplane supports a set iff the set is contained in one half-space the set has at least one point on the hyperplane β2 β2 β1 β1 cooperative-Lasso 17
  • 39. Convex analysis Supporting Hyperplane An hyperplane supports a set iff the set is contained in one half-space the set has at least one point on the hyperplane β2 β2 β1 β1 cooperative-Lasso 17
  • 40. Convex analysis Supporting Hyperplane An hyperplane supports a set iff the set is contained in one half-space the set has at least one point on the hyperplane β2 β2 β1 β1 cooperative-Lasso 17
  • 41. Convex analysis Supporting Hyperplane An hyperplane supports a set iff the set is contained in one half-space the set has at least one point on the hyperplane β2 β2 β2 β1 β1 β1 There are Supporting Hyperplane at all points of convex sets: Generalize tangents cooperative-Lasso 17
  • 42. Convex analysis Dual Cone and subgradient Generalizes normals β2 β2 β2 β1 β1 β1 g is a subgradient at x the vector (g, −1) is normal to the supporting hyperplane at this point The subdifferential at x is the set of all subgradient at x. cooperative-Lasso 18
  • 43. Convex analysis Dual Cone and subgradient Generalizes normals β2 β2 β2 β1 β1 β1 g is a subgradient at x the vector (g, −1) is normal to the supporting hyperplane at this point The subdifferential at x is the set of all subgradient at x. cooperative-Lasso 18
  • 44. Convex analysis Dual Cone and subgradient Generalizes normals β2 β2 β2 β1 β1 β1 g is a subgradient at x the vector (g, −1) is normal to the supporting hyperplane at this point The subdifferential at x is the set of all subgradient at x. cooperative-Lasso 18
  • 45. Convex analysis Dual Cone and subgradient Generalizes normals β2 β2 β2 β1 β1 β1 g is a subgradient at x the vector (g, −1) is normal to the supporting hyperplane at this point The subdifferential at x is the set of all subgradient at x. cooperative-Lasso 18
  • 46. Optimality conditions Theorem A necessary and sufficient condition for the optimality of β is that the null vector 0 belong to the subdifferential of the convex function J: 0 ∂β J(β) = {v ∈ Rp : v = − β (β) + λθ}, where θ ∈ Rp belongs to the subdifferential of the coop-norm. Define ϕj (v) = (sign(vj )v)+ , then θ is such as βj ∀k ∈ {1, . . . , K} , ∀j ∈ Sk (β) , θj = , ϕj (β Gk ) c ∀k ∈ {1, . . . , K} , ∀j ∈ Sk (β) , ϕj (θ Gk ) ≤ 1. We derive a subset algorithm to solve that problem (that you can enjoy in the paper and the package). cooperative-Lasso 19
  • 47. Optimality conditions Theorem A necessary and sufficient condition for the optimality of β is that the null vector 0 belong to the subdifferential of the convex function J: 0 ∂β J(β) = {v ∈ Rp : v = − β (β) + λθ}, where θ ∈ Rp belongs to the subdifferential of the coop-norm. Define ϕj (v) = (sign(vj )v)+ , then θ is such as βj ∀k ∈ {1, . . . , K} , ∀j ∈ Sk (β) , θj = , ϕj (β Gk ) c ∀k ∈ {1, . . . , K} , ∀j ∈ Sk (β) , ϕj (θ Gk ) ≤ 1. We derive a subset algorithm to solve that problem (that you can enjoy in the paper and the package). cooperative-Lasso 19
  • 48. Optimality conditions Theorem A necessary and sufficient condition for the optimality of β is that the null vector 0 belong to the subdifferential of the convex function J: 0 ∂β J(β) = {v ∈ Rp : v = − β (β) + λθ}, where θ ∈ Rp belongs to the subdifferential of the coop-norm. Define ϕj (v) = (sign(vj )v)+ , then θ is such as βj ∀k ∈ {1, . . . , K} , ∀j ∈ Sk (β) , θj = , ϕj (β Gk ) c ∀k ∈ {1, . . . , K} , ∀j ∈ Sk (β) , ϕj (θ Gk ) ≤ 1. We derive a subset algorithm to solve that problem (that you can enjoy in the paper and the package). cooperative-Lasso 19
  • 49. Linear regression with orthonormal design Consider ˆ 1 2 β = arg min y − Xβ + λΩ(β) , β 2 ˆols with X X = I. Hence, (xj ) (Xβ − y) = βj − β and ˆ 1 ˆols β = arg min β (β − β ) + λΩ(β) . β 2 We may find a closed-form of β for, e.g., 1. Ω(β) = β lasso , 2. Ω(β) = β group , 3. Ω(β) = β coop . cooperative-Lasso 20
  • 50. Linear regression with orthonormal design Consider ˆ 1 2 β = arg min y − Xβ + λΩ(β) , β 2 ˆols with X X = I. Hence, (xj ) (Xβ − y) = βj − β and ˆ 1 ˆols β = arg min β (β − β ) + λΩ(β) . β 2 We may find a closed-form of β for, e.g., 1. Ω(β) = β lasso , 2. Ω(β) = β group , 3. Ω(β) = β coop . cooperative-Lasso 20
  • 51. Linear regression with orthonormal design ˆlasso β1 ∀j ∈ {1, . . . , p} ,  + ˆlasso λ  ˆols βj = 1 − βj , ˆ β olsj + ˆlasso = βj ˆols βj − λ . ˆols β2 ˆols β1 Fig.: Lasso as a function of the OLS coefficients cooperative-Lasso 20
  • 52. Linear regression with orthonormal design ˆgroup β1 ∀k ∈ {1, . . . , K} , ∀j ∈ Gk ,  + ˆgroup = 1 − λ  ˆols βj βj , βˆols Gk + ˆgroup = β Gk ˆols β Gk − λ . ˆols β2 ˆols β1 Fig.: Group-Lasso as a function of the OLS coefficients cooperative-Lasso 20
  • 53. Linear regression with orthonormal design ˆcoop β1 ∀k ∈ {1, . . . , K} , ∀j ∈ Gk ,  + ˆcoop λ ˆols βj = 1 − ols  βj , ˆ ϕ (β ) j Gk + ˆcoop ϕj (β Gk ) = ˆols ϕj (β Gk ) − λ . ˆols β2 ˆols β1 Fig.: Coop-Lasso as a function of the OLS coefficients cooperative-Lasso 20
  • 54. Outline Definition Resolution Consistency Model selection Simulation studies Sibling probe sets and gene selection cooperative-Lasso 21
  • 55. Linear regression setup Technical assumptions (A1) X and Y have finite fourth order moments 4 E X < ∞, E|Y |4 < ∞, (A2) the covariance matrix Ψ = EXX ∈ Rp×p is invertible, (A3) for every k = 1, . . . , K, if (β )+ > 0 and (β )− > 0 then for every j ∈ Gk β j = 0. (All sign-coherent groups are either included or excluded from the true support). cooperative-Lasso 22
  • 56. Irrepresentability condition Define Sk = S ∩ Gk the support within a group and −1 [D(β)]jj = [sign(βj )β Gk ]+ . Assume there exists η > 0 such that (A4) For every group Gk including at least one null coefficient: max( (ΨSk S Ψ−1 D(β S )β S )+ , (ΨSk S Ψ−1 D(β S )β S )− ) ≤ 1 − η, c SS c SS (A5) For every group Gk intersecting the support and including either positive or negative coefficients, let νk be the sign of these coefficients (νk = 1 if (β Gk )+ > 0 and νk = −1 if (β Gk )− > 0): νk ΨSk S Ψ−1 D(β S )β S c SS 0, where denotes componentwise inequality. cooperative-Lasso 23
  • 57. Consistency results Theorem If assumptions (A1-5) are satisfied and if there exists η > 0, then for every sequence λn such that λn = λ0 n−γ , γ ∈]0, 1/2[, ˆcoop −→ β β P ˆ and P(S(β coop ) = S) → 1. Asymptotically, the cooperative-Lasso is unbiased and enjoys exact support recovery (even when there are irrelevant variables within a group). cooperative-Lasso 24
  • 58. Sketch of the proof ˜ 1. Construct an artifical estimator β S restricted to the true support S and extend it with 0 coefficients on S c . ˜ 2. Consider the event En on which β satisfies the original optimality coop ˜ conditions. On En , β = β ˆ ˆcoop and β c = 0, by uniqueness. S S S 3. We need to prove that limn→∞ P(En ) = 1. 4. Derive the asymptotic distribution of the derivative of the loss ˜ function X (y − Xβ) from TCL on second order moments, ˜ Optimality conditions on β S . Right choice of λn provides convergence in probability. 5. Assumptions (A4-5) assume that the limits in probability satisfy optimality constraints with strict inequalities. 6. As a result, optimility conditions are satisfied (with large inequalities) with probability tending to 1. cooperative-Lasso 25
  • 59. Sketch of the proof ˜ 1. Construct an artifical estimator β S restricted to the true support S and extend it with 0 coefficients on S c . ˜ 2. Consider the event En on which β satisfies the original optimality coop ˜ conditions. On En , β = β ˆ ˆcoop and β c = 0, by uniqueness. S S S 3. We need to prove that limn→∞ P(En ) = 1. 4. Derive the asymptotic distribution of the derivative of the loss ˜ function X (y − Xβ) from TCL on second order moments, ˜ Optimality conditions on β S . Right choice of λn provides convergence in probability. 5. Assumptions (A4-5) assume that the limits in probability satisfy optimality constraints with strict inequalities. 6. As a result, optimility conditions are satisfied (with large inequalities) with probability tending to 1. cooperative-Lasso 25
  • 60. Sketch of the proof ˜ 1. Construct an artifical estimator β S restricted to the true support S and extend it with 0 coefficients on S c . ˜ 2. Consider the event En on which β satisfies the original optimality coop ˜ conditions. On En , β = β ˆ ˆcoop and β c = 0, by uniqueness. S S S 3. We need to prove that limn→∞ P(En ) = 1. 4. Derive the asymptotic distribution of the derivative of the loss ˜ function X (y − Xβ) from TCL on second order moments, ˜ Optimality conditions on β S . Right choice of λn provides convergence in probability. 5. Assumptions (A4-5) assume that the limits in probability satisfy optimality constraints with strict inequalities. 6. As a result, optimility conditions are satisfied (with large inequalities) with probability tending to 1. cooperative-Lasso 25
  • 61. Sketch of the proof ˜ 1. Construct an artifical estimator β S restricted to the true support S and extend it with 0 coefficients on S c . ˜ 2. Consider the event En on which β satisfies the original optimality coop ˜ conditions. On En , β = β ˆ ˆcoop and β c = 0, by uniqueness. S S S 3. We need to prove that limn→∞ P(En ) = 1. 4. Derive the asymptotic distribution of the derivative of the loss ˜ function X (y − Xβ) from TCL on second order moments, ˜ Optimality conditions on β S . Right choice of λn provides convergence in probability. 5. Assumptions (A4-5) assume that the limits in probability satisfy optimality constraints with strict inequalities. 6. As a result, optimility conditions are satisfied (with large inequalities) with probability tending to 1. cooperative-Lasso 25
  • 62. Illustration 1.0 0.5 Generate data y = Xβ + σε, coefficients β = (1, 1, −1, −1, 0, 0, 0, 0) G = {{1, 2}, {3, 4}, {5, 6}, {7, 8}} 0.0 σ = 0.1, R2 ≈ 0.99, n = 20, irrepresentability conditions -0.5 holds for the coop-Lasso, holds not for the group-Lasso. average over 100 simulations. -1.0 -3 -2 -1 0 1 log10 (λ) Fig.:: 50% coverage intervals (upper / lower quartiles) cooperative-Lasso 26
  • 63. Illustration 1.0 0.5 Generate data y = Xβ + σε, coefficients β = (1, 1, −1, −1, 0, 0, 0, 0) G = {{1, 2}, {3, 4}, {5, 6}, {7, 8}} 0.0 σ = 0.1, R2 ≈ 0.99, n = 20, irrepresentability conditions -0.5 holds for the coop-Lasso, holds not for the group-Lasso. average over 100 simulations. -1.0 -3 -2 -1 0 1 log10 (λ) Fig.:group-Lasso: 50% coverage intervals (upper / lower quartiles) cooperative-Lasso 26
  • 64. Illustration 1.0 0.5 Generate data y = Xβ + σε, coefficients β = (1, 1, −1, −1, 0, 0, 0, 0) G = {{1, 2}, {3, 4}, {5, 6}, {7, 8}} 0.0 σ = 0.1, R2 ≈ 0.99, n = 20, irrepresentability conditions -0.5 holds for the coop-Lasso, holds not for the group-Lasso. average over 100 simulations. -1.0 -3 -2 -1 0 1 log10 (λ) Fig.:coop-Lasso: 50% coverage intervals (upper / lower quartiles) cooperative-Lasso 26
  • 65. Outline Definition Resolution Consistency Model selection Simulation studies Sibling probe sets and gene selection cooperative-Lasso 27
  • 66. Optimism of the training error The training error: 1 ˆ err = L(yi , xi β). |D| i∈D The test error (“extra-sample” error): ˆ Errex = EX,Y [L(Y, X β)|D]. The “in-sample” error 1 ˆ Errin = EY L(Yi , xi β)|D . |D| i∈D Definition (Optimism) Errin = err + ”optimism”. cooperative-Lasso 28
  • 67. Optimism of the training error The training error: 1 ˆ err = L(yi , xi β). |D| i∈D The test error (“extra-sample” error): ˆ Errex = EX,Y [L(Y, X β)|D]. The “in-sample” error 1 ˆ Errin = EY L(Yi , xi β)|D . |D| i∈D Definition (Optimism) Errin = err + ”optimism”. cooperative-Lasso 28
  • 68. Cp statistics For squared-error loss (and some other loss), 2 Errin = err + cov(ˆi , yi ). y |D| i∈D The amount by which err underestimates the true error depends on how strongly yi affects its own prediction. The harder we fit the data, the greater the covariance will be thereby increasing the optimism (ESLII 5th print). Mallows’ Cp Statistic ˆ For a linear regression fit yi with p inputs i∈D cov(ˆi , yi ) = pσ 2 : y df 2 Cp = err + 2 · ˆ σ , with df = p. |D| cooperative-Lasso 29
  • 69. Cp statistics For squared-error loss (and some other loss), 2 Errin = err + cov(ˆi , yi ). y |D| i∈D The amount by which err underestimates the true error depends on how strongly yi affects its own prediction. The harder we fit the data, the greater the covariance will be thereby increasing the optimism (ESLII 5th print). Mallows’ Cp Statistic ˆ For a linear regression fit yi with p inputs i∈D cov(ˆi , yi ) = pσ 2 : y df 2 Cp = err + 2 · ˆ σ , with df = p. |D| cooperative-Lasso 29
  • 70. Generalized degrees of freedom ˆ ˆ Let y(λ) = Xβ(λ) be the predicted values for a penalized estimator. Proposition (Efron (’04)+ Stein’s Lemma (’81)) . 1 ˆ ∂ yλ df(λ) = 2 cov(ˆi (λ), yi ) = Ey tr y . σ ∂y i∈D For the Lasso, Zou et al. (’07) show that ˆlasso (λ) df lasso (λ) = β . 0 Assuming X X = I Yuan and Lin (’06) show for the group-Lasso that the trace term equals ˆgroup   K β Gk (λ) df group (λ) = ˆgroup 1 β Gk (λ) > 0 1 + (pk − 1) . k=1 β ols Gk cooperative-Lasso 30
  • 71. Generalized degrees of freedom ˆ ˆ Let y(λ) = Xβ(λ) be the predicted values for a penalized estimator. Proposition (Efron (’04)+ Stein’s Lemma (’81)) . 1 ˆ ∂ yλ df(λ) = 2 cov(ˆi (λ), yi ) = Ey tr y . σ ∂y i∈D For the Lasso, Zou et al. (’07) show that ˆlasso (λ) df lasso (λ) = β . 0 Assuming X X = I Yuan and Lin (’06) show for the group-Lasso that the trace term equals ˆgroup   K β Gk (λ) df group (λ) = ˆgroup 1 β Gk (λ) > 0 1 + (pk − 1) . k=1 β ols Gk cooperative-Lasso 30
  • 72. Generalized degrees of freedom ˆ ˆ Let y(λ) = Xβ(λ) be the predicted values for a penalized estimator. Proposition (Efron (’04)+ Stein’s Lemma (’81)) . 1 ˆ ∂ yλ df(λ) = 2 cov(ˆi (λ), yi ) = Ey tr y . σ ∂y i∈D For the Lasso, Zou et al. (’07) show that ˆlasso (λ) df lasso (λ) = β . 0 Assuming X X = I Yuan and Lin (’06) show for the group-Lasso that the trace term equals ˆgroup   K β Gk (λ) df group (λ) = ˆgroup 1 β Gk (λ) > 0 1 + (pk − 1) . k=1 β ols Gk cooperative-Lasso 30
  • 73. Approximated degrees of freedom for the coop-Lasso Proposition Assuming that data are generated according to a linear regression model and that X is orthonormal, the following expression of df coop (λ) is an unbiased estimate of df(λ) +   K ˆcoop β Gk (λ) 1 + (pk − 1)   df coop (λ) = 1 + +  ˆcoop + β G (λ) >0 ˆols   k=1 k β Gk −   ˆcoop β Gk (λ) 1 + (pk − 1)   +1 − −  , ˆcoop − β G (λ) >0 β ols   k Gk where pk and pk are respectively the number of positive and negative + − ˆols entries in β (γ). Gk cooperative-Lasso 31
  • 74. Approximated degrees of freedom for the coop-Lasso Proposition Assuming that data are generated according to a linear regression model and that X is orthonormal, the following expression of df coop (λ) is an unbiased estimate of df(λ) +   K ˆcoop β Gk (λ) k 1 + p+ − 1   df coop (λ) = 1 +  ˆcoop 1+γ + β G (λ) >0 ˆridge   k=1 k β Gk (γ) −   ˆcoop β Gk (λ) k 1 + p− − 1   +1 −  , ˆcoop 1+γ − β G (λ) >0 ˆridge   k β Gk (γ) where pk and pk are respectively the number of positive and negative + − entries in βˆridge (γ). Gk cooperative-Lasso 31
  • 75. Approximated information criteria Following Zou et al, we extend the Cp stat to an “approximated” AIC y − y(λ) ˆ ˜ AIC(λ) = + 2df(λ), σ2 and from the AIC, there is (small) step to BIC: y − y(λ) ˆ ˜ BIC(λ) = + log(n)df(λ). σ2 The K–fold cross-validation works well but is computationally intensive. It is required when we do not meet the linear regression setup. . . cooperative-Lasso 32
  • 76. Outline Definition Resolution Consistency Model selection Simulation studies Sibling probe sets and gene selection cooperative-Lasso 33
  • 77. Revisiting Elastic-Net experiments (1) q Generate data y = Xβ + σε, 70 q q q β = q q (0, . . . , 0, 2, . . . , 2, 0, . . . , 0, 2, . . . , 2) 60 q q q q q 10 10 10 10 50 G1 = {1, . . . , 10}, G2 = {11, . . . , 20}, MSE G3 = {21, . . . , 30}, 40 G4 = {31, . . . , 40}. σ = 15, corr(xi , xj ) = 0.5, 30 training/validation/test/ = 100/100/400, 20 q average over 100 simulations. 10 lasso enet group coop cooperative-Lasso 34
  • 78. Revisiting Elastic-Net experiments (2) Generate data y = Xβ + σε, β = (3, . . . , 3, 0, . . . , 0) q 250 15 25 q q σ = 15, 200 G1 = {1, . . . , 5}, G2 = {6, . . . , 10}, q G3 = {11, . . . , 15}, 150 G4 = {16, . . . , 40}. MSE xj = Z1 + ε, Z1 ∼ N (0, 1), ∀j ∈ G1 100 q q q q q xj = Z3 + ε, Z2 ∼ N (0, 1), ∀j ∈ G2 q xj = Z3 + ε, Z3 ∼ N (0, 1), ∀j ∈ G3 50 xj ∼ N (0, 1), ∀j ∈ G4 . training/validation/test/ = 50/50/400, 0 lasso enet group coop average over 100 simulations. cooperative-Lasso 35
  • 79. Breiman’s setup Simulations setting A wave-like vector of parameters β p = 90 variables partitioned into K = 10 groups of size pk = 9, 3 (partially) active groups, 6 groups of zeros, in active groups, β j ∝ (h − |5 − j|) with h = 1, . . . , 5. 0 20 40 60 80 Figure: β with h = 1, |Sk | = 1 non-zero coefficients in each active group. cooperative-Lasso 36
  • 80. Breiman’s setup Simulations setting A wave-like vector of parameters β p = 90 variables partitioned into K = 10 groups of size pk = 9, 3 (partially) active groups, 6 groups of zeros, in active groups, β j ∝ (h − |5 − j|) with h = 1, . . . , 5. 0 20 40 60 80 Figure: β with h = 2, |Sk | = 3 non-zero coefficients in each active group. cooperative-Lasso 36
  • 81. Breiman’s setup Simulations setting A wave-like vector of parameters β p = 90 variables partitioned into K = 10 groups of size pk = 9, 3 (partially) active groups, 6 groups of zeros, in active groups, β j ∝ (h − |5 − j|) with h = 1, . . . , 5. 0 20 40 60 80 Figure: β with h = 3, |Sk | = 5 non-zero coefficients in each active group. cooperative-Lasso 36
  • 82. Breiman’s setup Simulations setting A wave-like vector of parameters β p = 90 variables partitioned into K = 10 groups of size pk = 9, 3 (partially) active groups, 6 groups of zeros, in active groups, β j ∝ (h − |5 − j|) with h = 1, . . . , 5. 0 20 40 60 80 Figure: β with h = 4, |Sk | = 7 non-zero coefficients in each active group. cooperative-Lasso 36
  • 83. Breiman’s setup Simulations setting A wave-like vector of parameters β p = 90 variables partitioned into K = 10 groups of size pk = 9, 3 (partially) active groups, 6 groups of zeros, in active groups, β j ∝ (h − |5 − j|) with h = 1, . . . , 5. 0 20 40 60 80 Figure: β with h = 5, |Sk | = 9 non-zero coefficients in each active group. cooperative-Lasso 36
  • 84. Breiman’s setup Example of path of solution and signal recovery with BIC choice The signal strength is generated so as y = Xβ + σ , with σ = 1, n = 30 to 500, X ∼ N (0, Ψ) with Ψij = ρ|i−j| (ρ = 0.4 in the example), magnitude in β chosen so as R2 ≈ 0.75. Remark Covariance structure is purposely disconnected from the group structure. None of the support recovery conditions are fulfilled. cooperative-Lasso 37
  • 85. Breiman’s setup Example of path of solution and signal recovery with BIC choice The signal strength is generated so as y = Xβ + σ , with σ = 1, n = 30 to 500, X ∼ N (0, Ψ) with Ψij = ρ|i−j| (ρ = 0.4 in the example), magnitude in β chosen so as R2 ≈ 0.75. One shot sample with n = 120 cooperative-Lasso 37
  • 86. Breiman’s setup Example of path of solution and signal recovery with BIC choice The signal strength is generated so as y = Xβ + σ , with σ = 1, n = 30 to 500, X ∼ N (0, Ψ) with Ψij = ρ|i−j| (ρ = 0.4 in the example), magnitude in β chosen so as R2 ≈ 0.75. 0.6 0.5 0.4 0.4 0.3 ˆlasso ˆlasso 0.2 True signal 0.2 β β Estimated signal 0.1 0.0 0.0 -0.2 -0.1 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 1.0 0 20 40 60 80 log10 (λ) i Figure: Lasso cooperative-Lasso 37
  • 87. Breiman’s setup Example of path of solution and signal recovery with BIC choice The signal strength is generated so as y = Xβ + σ , with σ = 1, n = 30 to 500, X ∼ N (0, Ψ) with Ψij = ρ|i−j| (ρ = 0.4 in the example), magnitude in β chosen so as R2 ≈ 0.75. 0.5 0.5 0.4 0.4 0.3 0.3 ˆgroup ˆgroup 0.2 True signal 0.2 β β 0.1 Estimated signal 0.1 0.0 0.0 -0.1 -0.1 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 0 20 40 60 80 log10 (λ) i Figure: Group-Lasso cooperative-Lasso 37
  • 88. Breiman’s setup Example of path of solution and signal recovery with BIC choice The signal strength is generated so as y = Xβ + σ , with σ = 1, n = 30 to 500, X ∼ N (0, Ψ) with Ψij = ρ|i−j| (ρ = 0.4 in the example), magnitude in β chosen so as R2 ≈ 0.75. 0.5 0.5 0.4 0.4 0.3 0.3 ˆcoop ˆcoop True signal 0.2 0.2 β β Estimated signal 0.1 0.1 0.0 0.0 -0.1 -0.1 -0.4 -0.2 0.0 0.2 0.4 0.6 0.8 0 20 40 60 80 log10 (λ) i Figure: Coop-Lasso cooperative-Lasso 37
  • 89. Breiman’s setup Errors as a function of the sample size n 0.30 1.2 0.25 1.0 prediction error 0.20 0.8 sign error 0.15 0.6 0.10 0.4 0.05 0.2 0.00 0.0 100 200 300 400 500 100 200 300 400 500 n n Figure: h = 3, |Sk | = 5 (favoring Lasso). lasso group coop cooperative-Lasso 38
  • 90. Breiman’s setup Errors as a function of the sample size n 0.30 1.2 0.25 1.0 prediction error 0.20 0.8 sign error 0.15 0.6 0.10 0.4 0.05 0.2 0.00 0.0 100 200 300 400 500 100 200 300 400 500 n n Figure: h = 4, |Sk | = 7 (intermediate). lasso group coop cooperative-Lasso 38