SlideShare une entreprise Scribd logo
1  sur  181
Télécharger pour lire hors ligne
Random forests

                    Gérard Biau




      Groupe de travail prévision, May 2011

G. Biau (UPMC)                                1 / 106
Outline




1   Setting


2   A random forests model


3   Layered nearest neighbors and random forests




      G. Biau (UPMC)                               2 / 106
Outline




1   Setting


2   A random forests model


3   Layered nearest neighbors and random forests




      G. Biau (UPMC)                               3 / 106
Leo Breiman (1928-2005)




    G. Biau (UPMC)        4 / 106
Leo Breiman (1928-2005)




    G. Biau (UPMC)        5 / 106
Classification




    G. Biau (UPMC)   6 / 106
Classification




                     ?




    G. Biau (UPMC)       7 / 106
Classification




    G. Biau (UPMC)   8 / 106
Regression


                                                                                              −2.33
                     1.38                           6.76                    5.89
                                                                                      12.00
                                                             0.56
                                      2.27

                                                                                       0.12

                                                                            4.73
                                                             31.2
                                                                                      −2.88



                                                                            −2.56
                             7.72

                                                                                                      7.26
                                                              0.12                      4.32
                               2.23
                                                                     7.77
                                                                                               −1.23

                                             3.82
                            1.09
                                                           8.21                     −2.66




    G. Biau (UPMC)                                                                                           9 / 106
Regression


                                                                                              −2.33
                     1.38                           6.76                    5.89
                                                                                      12.00
                                                             0.56
                                      2.27

                                                                                       0.12

                                                                            4.73
                                                             31.2
                                                                                      −2.88




                             7.72
                                               ?                            −2.56


                                                                                                      7.26
                                                              0.12                      4.32
                               2.23
                                                                     7.77
                                                                                               −1.23

                                             3.82
                            1.09
                                                           8.21                     −2.66




    G. Biau (UPMC)                                                                                           10 / 106
Regression


                                                                                              −2.33
                     1.38                           6.76                    5.89
                                                                                      12.00
                                                             0.56
                                      2.27

                                                                                       0.12

                                                                            4.73
                                                             31.2
                                                                                      −2.88



                                               2.25                         −2.56
                             7.72

                                                                                                      7.26
                                                              0.12                      4.32
                               2.23
                                                                     7.77
                                                                                               −1.23

                                             3.82
                            1.09
                                                           8.21                     −2.66




    G. Biau (UPMC)                                                                                           11 / 106
Trees



                     ?




    G. Biau (UPMC)       12 / 106
Trees



                     ?




    G. Biau (UPMC)       13 / 106
Trees



                     ?




    G. Biau (UPMC)       14 / 106
Trees



                     ?




    G. Biau (UPMC)       15 / 106
Trees



                     ?




    G. Biau (UPMC)       16 / 106
Trees



                     ?




    G. Biau (UPMC)       17 / 106
Trees



                     ?




    G. Biau (UPMC)       18 / 106
Trees




    G. Biau (UPMC)   19 / 106
Trees




    G. Biau (UPMC)   20 / 106
Trees




    G. Biau (UPMC)   21 / 106
Trees




    G. Biau (UPMC)   22 / 106
Trees




    G. Biau (UPMC)   23 / 106
Trees




    G. Biau (UPMC)   24 / 106
Trees




    G. Biau (UPMC)   25 / 106
Trees




    G. Biau (UPMC)   26 / 106
Trees




    ?




    G. Biau (UPMC)   27 / 106
Trees




    ?




    G. Biau (UPMC)   28 / 106
Trees




    ?




    G. Biau (UPMC)   29 / 106
Trees




    ?




    G. Biau (UPMC)   30 / 106
Trees




    ?




    G. Biau (UPMC)   31 / 106
Trees




    ?




    G. Biau (UPMC)   32 / 106
Trees




    ?




    G. Biau (UPMC)   33 / 106
From trees to forests

    Leo Breiman promoted random forests.

    Idea: Using tree averaging as a means of obtaining good rules.

    The base trees are simple and randomized.

Breiman’s ideas were decisively influenced by

    Amit and Geman (1997, geometric feature selection).

    Ho (1998, random subspace method).

    Dietterich (2000, random split selection approach).

    stat.berkeley.edu/users/breiman/RandomForests


     G. Biau (UPMC)                                              34 / 106
From trees to forests

    Leo Breiman promoted random forests.

    Idea: Using tree averaging as a means of obtaining good rules.

    The base trees are simple and randomized.

Breiman’s ideas were decisively influenced by

    Amit and Geman (1997, geometric feature selection).

    Ho (1998, random subspace method).

    Dietterich (2000, random split selection approach).

    stat.berkeley.edu/users/breiman/RandomForests


     G. Biau (UPMC)                                              34 / 106
From trees to forests

    Leo Breiman promoted random forests.

    Idea: Using tree averaging as a means of obtaining good rules.

    The base trees are simple and randomized.

Breiman’s ideas were decisively influenced by

    Amit and Geman (1997, geometric feature selection).

    Ho (1998, random subspace method).

    Dietterich (2000, random split selection approach).

    stat.berkeley.edu/users/breiman/RandomForests


     G. Biau (UPMC)                                              34 / 106
From trees to forests

    Leo Breiman promoted random forests.

    Idea: Using tree averaging as a means of obtaining good rules.

    The base trees are simple and randomized.

Breiman’s ideas were decisively influenced by

    Amit and Geman (1997, geometric feature selection).

    Ho (1998, random subspace method).

    Dietterich (2000, random split selection approach).

    stat.berkeley.edu/users/breiman/RandomForests


     G. Biau (UPMC)                                              34 / 106
From trees to forests

    Leo Breiman promoted random forests.

    Idea: Using tree averaging as a means of obtaining good rules.

    The base trees are simple and randomized.

Breiman’s ideas were decisively influenced by

    Amit and Geman (1997, geometric feature selection).

    Ho (1998, random subspace method).

    Dietterich (2000, random split selection approach).

    stat.berkeley.edu/users/breiman/RandomForests


     G. Biau (UPMC)                                              34 / 106
Random forests

   They have emerged as serious competitors to state of the art
   methods.

   They are fast and easy to implement, produce highly accurate
   predictions and can handle a very large number of input variables
   without overfitting.

   In fact, forests are among the most accurate general-purpose
   learners available.

   The algorithm is difficult to analyze and its mathematical
   properties remain to date largely unknown.

   Most theoretical studies have concentrated on isolated parts or
   stylized versions of the procedure.


    G. Biau (UPMC)                                                35 / 106
Random forests

   They have emerged as serious competitors to state of the art
   methods.

   They are fast and easy to implement, produce highly accurate
   predictions and can handle a very large number of input variables
   without overfitting.

   In fact, forests are among the most accurate general-purpose
   learners available.

   The algorithm is difficult to analyze and its mathematical
   properties remain to date largely unknown.

   Most theoretical studies have concentrated on isolated parts or
   stylized versions of the procedure.


    G. Biau (UPMC)                                                35 / 106
Random forests

   They have emerged as serious competitors to state of the art
   methods.

   They are fast and easy to implement, produce highly accurate
   predictions and can handle a very large number of input variables
   without overfitting.

   In fact, forests are among the most accurate general-purpose
   learners available.

   The algorithm is difficult to analyze and its mathematical
   properties remain to date largely unknown.

   Most theoretical studies have concentrated on isolated parts or
   stylized versions of the procedure.


    G. Biau (UPMC)                                                35 / 106
Random forests

   They have emerged as serious competitors to state of the art
   methods.

   They are fast and easy to implement, produce highly accurate
   predictions and can handle a very large number of input variables
   without overfitting.

   In fact, forests are among the most accurate general-purpose
   learners available.

   The algorithm is difficult to analyze and its mathematical
   properties remain to date largely unknown.

   Most theoretical studies have concentrated on isolated parts or
   stylized versions of the procedure.


    G. Biau (UPMC)                                                35 / 106
Random forests

   They have emerged as serious competitors to state of the art
   methods.

   They are fast and easy to implement, produce highly accurate
   predictions and can handle a very large number of input variables
   without overfitting.

   In fact, forests are among the most accurate general-purpose
   learners available.

   The algorithm is difficult to analyze and its mathematical
   properties remain to date largely unknown.

   Most theoretical studies have concentrated on isolated parts or
   stylized versions of the procedure.


    G. Biau (UPMC)                                                35 / 106
Key-references


   Breiman (2000, 2001, 2004).
         Definition, experiments and intuitions.

   Lin and Jeon (2006).
         Link with layered nearest neighbors.

   Biau, Devroye and Lugosi (2008).
         Consistency results for stylized versions.

   Biau (2010).
         Sparsity and random forests.




    G. Biau (UPMC)                                    36 / 106
Key-references


   Breiman (2000, 2001, 2004).
         Definition, experiments and intuitions.

   Lin and Jeon (2006).
         Link with layered nearest neighbors.

   Biau, Devroye and Lugosi (2008).
         Consistency results for stylized versions.

   Biau (2010).
         Sparsity and random forests.




    G. Biau (UPMC)                                    36 / 106
Key-references


   Breiman (2000, 2001, 2004).
         Definition, experiments and intuitions.

   Lin and Jeon (2006).
         Link with layered nearest neighbors.

   Biau, Devroye and Lugosi (2008).
         Consistency results for stylized versions.

   Biau (2010).
         Sparsity and random forests.




    G. Biau (UPMC)                                    36 / 106
Key-references


   Breiman (2000, 2001, 2004).
         Definition, experiments and intuitions.

   Lin and Jeon (2006).
         Link with layered nearest neighbors.

   Biau, Devroye and Lugosi (2008).
         Consistency results for stylized versions.

   Biau (2010).
         Sparsity and random forests.




    G. Biau (UPMC)                                    36 / 106
Outline




1   Setting


2   A random forests model


3   Layered nearest neighbors and random forests




      G. Biau (UPMC)                               37 / 106
Three basic ingredients
1-Randomization and no-pruning
   For each tree, select at random, at each node, a small group of
   input coordinates to split.

   Calculate the best split based on these features and cut.

   The tree is grown to maximum size, without pruning.




    G. Biau (UPMC)                                               38 / 106
Three basic ingredients
1-Randomization and no-pruning
   For each tree, select at random, at each node, a small group of
   input coordinates to split.

   Calculate the best split based on these features and cut.

   The tree is grown to maximum size, without pruning.




    G. Biau (UPMC)                                               38 / 106
Three basic ingredients
1-Randomization and no-pruning
   For each tree, select at random, at each node, a small group of
   input coordinates to split.

   Calculate the best split based on these features and cut.

   The tree is grown to maximum size, without pruning.




    G. Biau (UPMC)                                               38 / 106
Three basic ingredients



2-Aggregation
   Final predictions are obtained by aggregating over the ensemble.

   It is fast and easily parallelizable.




    G. Biau (UPMC)                                              39 / 106
Three basic ingredients



2-Aggregation
   Final predictions are obtained by aggregating over the ensemble.

   It is fast and easily parallelizable.




    G. Biau (UPMC)                                              39 / 106
Three basic ingredients

3-Bagging
   The subspace randomization scheme is blended with bagging.

   Breiman (1996).

   Bühlmann and Yu (2002).

   Biau, Cérou and Guyader (2010).




    G. Biau (UPMC)                                              40 / 106
Three basic ingredients

3-Bagging
   The subspace randomization scheme is blended with bagging.

   Breiman (1996).

   Bühlmann and Yu (2002).

   Biau, Cérou and Guyader (2010).




    G. Biau (UPMC)                                              40 / 106
Mathematical framework




   A training sample: Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d.
   [0, 1]d × R-valued random variables.

   A generic pair: (X, Y ) satisfying EY 2 < ∞.

   Our mission: For fixed x ∈ [0, 1]d , estimate the regression function
   r (x) = E[Y |X = x] using the data.

   Quality criterion: E[rn (X) − r (X)]2 .




    G. Biau (UPMC)                                                   41 / 106
Mathematical framework




   A training sample: Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d.
   [0, 1]d × R-valued random variables.

   A generic pair: (X, Y ) satisfying EY 2 < ∞.

   Our mission: For fixed x ∈ [0, 1]d , estimate the regression function
   r (x) = E[Y |X = x] using the data.

   Quality criterion: E[rn (X) − r (X)]2 .




    G. Biau (UPMC)                                                   41 / 106
Mathematical framework




   A training sample: Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d.
   [0, 1]d × R-valued random variables.

   A generic pair: (X, Y ) satisfying EY 2 < ∞.

   Our mission: For fixed x ∈ [0, 1]d , estimate the regression function
   r (x) = E[Y |X = x] using the data.

   Quality criterion: E[rn (X) − r (X)]2 .




    G. Biau (UPMC)                                                   41 / 106
Mathematical framework




   A training sample: Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d.
   [0, 1]d × R-valued random variables.

   A generic pair: (X, Y ) satisfying EY 2 < ∞.

   Our mission: For fixed x ∈ [0, 1]d , estimate the regression function
   r (x) = E[Y |X = x] using the data.

   Quality criterion: E[rn (X) − r (X)]2 .




    G. Biau (UPMC)                                                   41 / 106
The model


   A random forest is a collection of randomized base regression
   trees {rn (x, Θm , Dn ), m ≥ 1}.

   These random trees are combined to form the aggregated
   regression estimate

                     ¯n (X, Dn ) = EΘ [rn (X, Θ, Dn )] .
                     r

   Θ is assumed to be independent of X and the training sample Dn .

   However, we allow Θ to be based on a test sample, independent
   of, but distributed as, Dn .




    G. Biau (UPMC)                                                 42 / 106
The model


   A random forest is a collection of randomized base regression
   trees {rn (x, Θm , Dn ), m ≥ 1}.

   These random trees are combined to form the aggregated
   regression estimate

                     ¯n (X, Dn ) = EΘ [rn (X, Θ, Dn )] .
                     r

   Θ is assumed to be independent of X and the training sample Dn .

   However, we allow Θ to be based on a test sample, independent
   of, but distributed as, Dn .




    G. Biau (UPMC)                                                 42 / 106
The model


   A random forest is a collection of randomized base regression
   trees {rn (x, Θm , Dn ), m ≥ 1}.

   These random trees are combined to form the aggregated
   regression estimate

                     ¯n (X, Dn ) = EΘ [rn (X, Θ, Dn )] .
                     r

   Θ is assumed to be independent of X and the training sample Dn .

   However, we allow Θ to be based on a test sample, independent
   of, but distributed as, Dn .




    G. Biau (UPMC)                                                 42 / 106
The model


   A random forest is a collection of randomized base regression
   trees {rn (x, Θm , Dn ), m ≥ 1}.

   These random trees are combined to form the aggregated
   regression estimate

                     ¯n (X, Dn ) = EΘ [rn (X, Θ, Dn )] .
                     r

   Θ is assumed to be independent of X and the training sample Dn .

   However, we allow Θ to be based on a test sample, independent
   of, but distributed as, Dn .




    G. Biau (UPMC)                                                 42 / 106
The procedure

     Fix kn ≥ 2 and repeat the following procedure log2 kn times:

 1   At each node, a coordinate of X = (X (1) , . . . , X (d) ) is selected, with the
     j-th feature having a probability pnj ∈ (0, 1) of being selected.

 2   At each node, once the coordinate is selected, the split is at the midpoint
     of the chosen side.


     Thus
                                         n
                                         i=1 Yi 1[Xi ∈An (X,Θ)]
                       ¯n (X) = EΘ
                       r                   n                      1En (X,Θ) ,
                                           i=1 1[Xi ∈An (X,Θ)]

     where
                                            n
                           En (X, Θ) =           1[Xi ∈An (X,Θ)] = 0 .
                                           i=1


      G. Biau (UPMC)                                                               43 / 106
The procedure

     Fix kn ≥ 2 and repeat the following procedure log2 kn times:

 1   At each node, a coordinate of X = (X (1) , . . . , X (d) ) is selected, with the
     j-th feature having a probability pnj ∈ (0, 1) of being selected.

 2   At each node, once the coordinate is selected, the split is at the midpoint
     of the chosen side.


     Thus
                                         n
                                         i=1 Yi 1[Xi ∈An (X,Θ)]
                       ¯n (X) = EΘ
                       r                   n                      1En (X,Θ) ,
                                           i=1 1[Xi ∈An (X,Θ)]

     where
                                            n
                           En (X, Θ) =           1[Xi ∈An (X,Θ)] = 0 .
                                           i=1


      G. Biau (UPMC)                                                               43 / 106
The procedure

     Fix kn ≥ 2 and repeat the following procedure log2 kn times:

 1   At each node, a coordinate of X = (X (1) , . . . , X (d) ) is selected, with the
     j-th feature having a probability pnj ∈ (0, 1) of being selected.

 2   At each node, once the coordinate is selected, the split is at the midpoint
     of the chosen side.


     Thus
                                         n
                                         i=1 Yi 1[Xi ∈An (X,Θ)]
                       ¯n (X) = EΘ
                       r                   n                      1En (X,Θ) ,
                                           i=1 1[Xi ∈An (X,Θ)]

     where
                                            n
                           En (X, Θ) =           1[Xi ∈An (X,Θ)] = 0 .
                                           i=1


      G. Biau (UPMC)                                                               43 / 106
Binary trees




    G. Biau (UPMC)   44 / 106
Binary trees




    G. Biau (UPMC)   45 / 106
Binary trees




    G. Biau (UPMC)   46 / 106
Binary trees




    G. Biau (UPMC)   47 / 106
Binary trees




    G. Biau (UPMC)   48 / 106
General comments



   Each individual tree has exactly 2 log2 kn (≈ kn ) terminal nodes,
   and each leaf has Lebesgue measure 2− log2 kn (≈ 1/kn ).

   If X has uniform distribution on [0, 1]d , there will be on average
   about n/kn observations per terminal node.

   The choice kn = n induces a very small number of cases in the
   final leaves.


This scheme is close to what the original random
forests algorithm does.




    G. Biau (UPMC)                                                       49 / 106
General comments



   Each individual tree has exactly 2 log2 kn (≈ kn ) terminal nodes,
   and each leaf has Lebesgue measure 2− log2 kn (≈ 1/kn ).

   If X has uniform distribution on [0, 1]d , there will be on average
   about n/kn observations per terminal node.

   The choice kn = n induces a very small number of cases in the
   final leaves.


This scheme is close to what the original random
forests algorithm does.




    G. Biau (UPMC)                                                       49 / 106
General comments



   Each individual tree has exactly 2 log2 kn (≈ kn ) terminal nodes,
   and each leaf has Lebesgue measure 2− log2 kn (≈ 1/kn ).

   If X has uniform distribution on [0, 1]d , there will be on average
   about n/kn observations per terminal node.

   The choice kn = n induces a very small number of cases in the
   final leaves.


This scheme is close to what the original random
forests algorithm does.




    G. Biau (UPMC)                                                       49 / 106
General comments



   Each individual tree has exactly 2 log2 kn (≈ kn ) terminal nodes,
   and each leaf has Lebesgue measure 2− log2 kn (≈ 1/kn ).

   If X has uniform distribution on [0, 1]d , there will be on average
   about n/kn observations per terminal node.

   The choice kn = n induces a very small number of cases in the
   final leaves.


This scheme is close to what the original random
forests algorithm does.




    G. Biau (UPMC)                                                       49 / 106
Consistency


Theorem
Assume that the distribution of X has support on [0, 1]d . Then the
random forests estimate ¯n is consistent whenever pnj log kn → ∞ for
                               r
all j = 1, . . . , d and kn /n → 0 as n → ∞.

    In the purely random model, pnj = 1/d, independently of n and j,
    and consistency is ensured as long as kn → ∞ and kn /n → 0.

    This is however a radically simplified version of the random forests
    used in practice.

    A more in-depth analysis is needed.



     G. Biau (UPMC)                                                50 / 106
Consistency


Theorem
Assume that the distribution of X has support on [0, 1]d . Then the
random forests estimate ¯n is consistent whenever pnj log kn → ∞ for
                               r
all j = 1, . . . , d and kn /n → 0 as n → ∞.

    In the purely random model, pnj = 1/d, independently of n and j,
    and consistency is ensured as long as kn → ∞ and kn /n → 0.

    This is however a radically simplified version of the random forests
    used in practice.

    A more in-depth analysis is needed.



     G. Biau (UPMC)                                                50 / 106
Consistency


Theorem
Assume that the distribution of X has support on [0, 1]d . Then the
random forests estimate ¯n is consistent whenever pnj log kn → ∞ for
                               r
all j = 1, . . . , d and kn /n → 0 as n → ∞.

    In the purely random model, pnj = 1/d, independently of n and j,
    and consistency is ensured as long as kn → ∞ and kn /n → 0.

    This is however a radically simplified version of the random forests
    used in practice.

    A more in-depth analysis is needed.



     G. Biau (UPMC)                                                50 / 106
Consistency


Theorem
Assume that the distribution of X has support on [0, 1]d . Then the
random forests estimate ¯n is consistent whenever pnj log kn → ∞ for
                               r
all j = 1, . . . , d and kn /n → 0 as n → ∞.

    In the purely random model, pnj = 1/d, independently of n and j,
    and consistency is ensured as long as kn → ∞ and kn /n → 0.

    This is however a radically simplified version of the random forests
    used in practice.

    A more in-depth analysis is needed.



     G. Biau (UPMC)                                                50 / 106
Sparsity



   There is empirical evidence that many signals in high-dimensional
   spaces admit a sparse representation.
         Images wavelet coefficients.
         High-throughput technologies.

   Sparse estimation is playing an increasingly important role in the
   statistics and machine learning communities.

   Several methods have recently been developed in both fields,
   which rely upon the notion of sparsity.




    G. Biau (UPMC)                                                 51 / 106
Sparsity



   There is empirical evidence that many signals in high-dimensional
   spaces admit a sparse representation.
         Images wavelet coefficients.
         High-throughput technologies.

   Sparse estimation is playing an increasingly important role in the
   statistics and machine learning communities.

   Several methods have recently been developed in both fields,
   which rely upon the notion of sparsity.




    G. Biau (UPMC)                                                 51 / 106
Sparsity



   There is empirical evidence that many signals in high-dimensional
   spaces admit a sparse representation.
         Images wavelet coefficients.
         High-throughput technologies.

   Sparse estimation is playing an increasingly important role in the
   statistics and machine learning communities.

   Several methods have recently been developed in both fields,
   which rely upon the notion of sparsity.




    G. Biau (UPMC)                                                 51 / 106
Sparsity



   There is empirical evidence that many signals in high-dimensional
   spaces admit a sparse representation.
         Images wavelet coefficients.
         High-throughput technologies.

   Sparse estimation is playing an increasingly important role in the
   statistics and machine learning communities.

   Several methods have recently been developed in both fields,
   which rely upon the notion of sparsity.




    G. Biau (UPMC)                                                 51 / 106
Sparsity



   There is empirical evidence that many signals in high-dimensional
   spaces admit a sparse representation.
         Images wavelet coefficients.
         High-throughput technologies.

   Sparse estimation is playing an increasingly important role in the
   statistics and machine learning communities.

   Several methods have recently been developed in both fields,
   which rely upon the notion of sparsity.




    G. Biau (UPMC)                                                 51 / 106
Our vision


   The regression function r (X) = E[Y |X] depends in fact only on a
   nonempty subset S (for Strong) of the d features.

   In other words, letting XS = (Xj : j ∈ S) and S = Card S, we have

                            r (X) = E[Y |XS ].

   In the dimension reduction scenario we have in mind, the ambient
   dimension d can be very large, much larger than n.

   As such, the value S characterizes the sparsity of the model: The
   smaller S, the sparser r .




    G. Biau (UPMC)                                                52 / 106
Our vision


   The regression function r (X) = E[Y |X] depends in fact only on a
   nonempty subset S (for Strong) of the d features.

   In other words, letting XS = (Xj : j ∈ S) and S = Card S, we have

                            r (X) = E[Y |XS ].

   In the dimension reduction scenario we have in mind, the ambient
   dimension d can be very large, much larger than n.

   As such, the value S characterizes the sparsity of the model: The
   smaller S, the sparser r .




    G. Biau (UPMC)                                                52 / 106
Our vision


   The regression function r (X) = E[Y |X] depends in fact only on a
   nonempty subset S (for Strong) of the d features.

   In other words, letting XS = (Xj : j ∈ S) and S = Card S, we have

                            r (X) = E[Y |XS ].

   In the dimension reduction scenario we have in mind, the ambient
   dimension d can be very large, much larger than n.

   As such, the value S characterizes the sparsity of the model: The
   smaller S, the sparser r .




    G. Biau (UPMC)                                                52 / 106
Our vision


   The regression function r (X) = E[Y |X] depends in fact only on a
   nonempty subset S (for Strong) of the d features.

   In other words, letting XS = (Xj : j ∈ S) and S = Card S, we have

                            r (X) = E[Y |XS ].

   In the dimension reduction scenario we have in mind, the ambient
   dimension d can be very large, much larger than n.

   As such, the value S characterizes the sparsity of the model: The
   smaller S, the sparser r .




    G. Biau (UPMC)                                                52 / 106
Sparsity and random forests


    Ideally, pnj = 1/S for j ∈ S.

    To stick to reality, we will rather require that pnj = (1/S)(1 + ξnj ).

    Such a randomization mechanism may be designed on the basis
    of a test sample.

Action plan
        E [¯n (X) − r (X)]2 = E [¯n (X) − ˜n (X)]2 + E [˜n (X) − r (X)]2 ,
           r                     r        r             r
where
                                 n
                      ˜n (X) =
                      r                EΘ [Wni (X, Θ)] r (Xi ).
                                 i=1




     G. Biau (UPMC)                                                          53 / 106
Sparsity and random forests


    Ideally, pnj = 1/S for j ∈ S.

    To stick to reality, we will rather require that pnj = (1/S)(1 + ξnj ).

    Such a randomization mechanism may be designed on the basis
    of a test sample.

Action plan
        E [¯n (X) − r (X)]2 = E [¯n (X) − ˜n (X)]2 + E [˜n (X) − r (X)]2 ,
           r                     r        r             r
where
                                 n
                      ˜n (X) =
                      r                EΘ [Wni (X, Θ)] r (Xi ).
                                 i=1




     G. Biau (UPMC)                                                          53 / 106
Sparsity and random forests


    Ideally, pnj = 1/S for j ∈ S.

    To stick to reality, we will rather require that pnj = (1/S)(1 + ξnj ).

    Such a randomization mechanism may be designed on the basis
    of a test sample.

Action plan
        E [¯n (X) − r (X)]2 = E [¯n (X) − ˜n (X)]2 + E [˜n (X) − r (X)]2 ,
           r                     r        r             r
where
                                 n
                      ˜n (X) =
                      r                EΘ [Wni (X, Θ)] r (Xi ).
                                 i=1




     G. Biau (UPMC)                                                          53 / 106
Sparsity and random forests


    Ideally, pnj = 1/S for j ∈ S.

    To stick to reality, we will rather require that pnj = (1/S)(1 + ξnj ).

    Such a randomization mechanism may be designed on the basis
    of a test sample.

Action plan
        E [¯n (X) − r (X)]2 = E [¯n (X) − ˜n (X)]2 + E [˜n (X) − r (X)]2 ,
           r                     r        r             r
where
                                 n
                      ˜n (X) =
                      r                EΘ [Wni (X, Θ)] r (Xi ).
                                 i=1




     G. Biau (UPMC)                                                          53 / 106
Variance


Proposition
Assume that X is uniformly distributed on [0, 1]d and, for all x ∈ Rd ,

                        σ 2 (x) = V[Y | X = x] ≤ σ 2

for some positive constant σ 2 . Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S,
                                          S/2d
                                    S2                            kn
     E [¯n (X) − ˜n (X)]2 ≤ Cσ 2
        r        r                               (1 + ξn )                  ,
                                   S−1                       n(log kn )S/2d

where
                                               S/2d
                            288      π log 2
                         C=                           .
                             π         16



      G. Biau (UPMC)                                                            54 / 106
Bias

Proposition
Assume that X is uniformly distributed on [0, 1]d and r is L-Lipschitz.
Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S,

                                      2SL2
        E [˜n (X) − r (X)]2 ≤
           r                          0.75
                                                       +    sup r 2 (x) e−n/2kn .
                                             (1+γn )
                                kn   S log 2               x∈[0,1]d


    The rate at which the bias decreases to 0 depends on the number
    of strong variables, not on d.

    kn −(0.75/(S log 2))(1+γn ) = o(kn −2/d ) as soon as S ≤ 0.54d .

    The term e−n/2kn prevents the extreme choice kn = n.


       G. Biau (UPMC)                                                               55 / 106
Bias

Proposition
Assume that X is uniformly distributed on [0, 1]d and r is L-Lipschitz.
Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S,

                                      2SL2
        E [˜n (X) − r (X)]2 ≤
           r                          0.75
                                                       +    sup r 2 (x) e−n/2kn .
                                             (1+γn )
                                kn   S log 2               x∈[0,1]d


    The rate at which the bias decreases to 0 depends on the number
    of strong variables, not on d.

    kn −(0.75/(S log 2))(1+γn ) = o(kn −2/d ) as soon as S ≤ 0.54d .

    The term e−n/2kn prevents the extreme choice kn = n.


       G. Biau (UPMC)                                                               55 / 106
Bias

Proposition
Assume that X is uniformly distributed on [0, 1]d and r is L-Lipschitz.
Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S,

                                      2SL2
        E [˜n (X) − r (X)]2 ≤
           r                          0.75
                                                       +    sup r 2 (x) e−n/2kn .
                                             (1+γn )
                                kn   S log 2               x∈[0,1]d


    The rate at which the bias decreases to 0 depends on the number
    of strong variables, not on d.

    kn −(0.75/(S log 2))(1+γn ) = o(kn −2/d ) as soon as S ≤ 0.54d .

    The term e−n/2kn prevents the extreme choice kn = n.


       G. Biau (UPMC)                                                               55 / 106
Bias

Proposition
Assume that X is uniformly distributed on [0, 1]d and r is L-Lipschitz.
Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S,

                                      2SL2
        E [˜n (X) − r (X)]2 ≤
           r                          0.75
                                                       +    sup r 2 (x) e−n/2kn .
                                             (1+γn )
                                kn   S log 2               x∈[0,1]d


    The rate at which the bias decreases to 0 depends on the number
    of strong variables, not on d.

    kn −(0.75/(S log 2))(1+γn ) = o(kn −2/d ) as soon as S ≤ 0.54d .

    The term e−n/2kn prevents the extreme choice kn = n.


       G. Biau (UPMC)                                                               55 / 106
Main result

Theorem
If pnj = (1/S)(1 + ξnj ) for j ∈ S, with ξnj log n → 0 as n → ∞, then for
the choice
                                  1/(1+ S0.752 )
                              L2          log
                                                   1/(1+ S0.752 )
                   kn ∝                          n         log    ,
                              Ξ
we have

                                      E [¯n (X) − r (X)]2
                                         r
               lim sup sup                         0.75
                                                                                ≤ Λ.
                 n→∞ (X,Y )∈F        2S ln 2   S ln 2+0.75           −0.75
                                ΞL    0.75                   n   S log 2+0.75




Take-home message
                −0.75
The rate n S log 2+0.75 is strictly faster than the usual minimax rate
n−2/(d+2) as soon as S ≤ 0.54d .

      G. Biau (UPMC)                                                                   56 / 106
Main result

Theorem
If pnj = (1/S)(1 + ξnj ) for j ∈ S, with ξnj log n → 0 as n → ∞, then for
the choice
                                  1/(1+ S0.752 )
                              L2          log
                                                   1/(1+ S0.752 )
                   kn ∝                          n         log    ,
                              Ξ
we have

                                      E [¯n (X) − r (X)]2
                                         r
               lim sup sup                         0.75
                                                                                ≤ Λ.
                 n→∞ (X,Y )∈F        2S ln 2   S ln 2+0.75           −0.75
                                ΞL    0.75                   n   S log 2+0.75




Take-home message
                −0.75
The rate n S log 2+0.75 is strictly faster than the usual minimax rate
n−2/(d+2) as soon as S ≤ 0.54d .

      G. Biau (UPMC)                                                                   56 / 106
Dimension reduction

                   0.4


                  0.35


                   0.3


                  0.25


                   0.2


                  0.15


                   0.1


                  0.05

                0.0196
                     0
                         2   10   20   30   40   50    60   70   80   90   100
                                                   S




    G. Biau (UPMC)                                                               57 / 106
Discussion — Bagging




   The optimal parameter kn depends on the unknown distribution of
   (X, Y ).

   To correct this situation, adaptive choices of kn should preserve
   the rate of convergence of the estimate.

   Another route we may follow is to analyze the effect of bagging.

   Biau, Cérou and Guyader (2010).




    G. Biau (UPMC)                                                 58 / 106
Discussion — Bagging




   The optimal parameter kn depends on the unknown distribution of
   (X, Y ).

   To correct this situation, adaptive choices of kn should preserve
   the rate of convergence of the estimate.

   Another route we may follow is to analyze the effect of bagging.

   Biau, Cérou and Guyader (2010).




    G. Biau (UPMC)                                                 58 / 106
Discussion — Bagging




   The optimal parameter kn depends on the unknown distribution of
   (X, Y ).

   To correct this situation, adaptive choices of kn should preserve
   the rate of convergence of the estimate.

   Another route we may follow is to analyze the effect of bagging.

   Biau, Cérou and Guyader (2010).




    G. Biau (UPMC)                                                 58 / 106
Discussion — Bagging




   The optimal parameter kn depends on the unknown distribution of
   (X, Y ).

   To correct this situation, adaptive choices of kn should preserve
   the rate of convergence of the estimate.

   Another route we may follow is to analyze the effect of bagging.

   Biau, Cérou and Guyader (2010).




    G. Biau (UPMC)                                                 58 / 106
?




G. Biau (UPMC)       59 / 106
?




G. Biau (UPMC)       60 / 106
G. Biau (UPMC)   61 / 106
?




G. Biau (UPMC)       62 / 106
?




G. Biau (UPMC)       63 / 106
?




G. Biau (UPMC)       64 / 106
G. Biau (UPMC)   65 / 106
?




G. Biau (UPMC)       66 / 106
?




G. Biau (UPMC)       67 / 106
?




G. Biau (UPMC)       68 / 106
G. Biau (UPMC)   69 / 106
Discussion — Choosing the pnj ’s



Imaginary scenario
The following splitting scheme is iteratively repeated at each node:
 1   Select at random Mn candidate coordinates to split on.
 2   If the selection is all weak, then choose one at random to split on.
 3   If there is more than one strong variable elected, choose one at
     random and cut.




      G. Biau (UPMC)                                                   70 / 106
Discussion — Choosing the pnj ’s



Imaginary scenario
The following splitting scheme is iteratively repeated at each node:
 1   Select at random Mn candidate coordinates to split on.
 2   If the selection is all weak, then choose one at random to split on.
 3   If there is more than one strong variable elected, choose one at
     random and cut.




      G. Biau (UPMC)                                                   70 / 106
Discussion — Choosing the pnj ’s



Imaginary scenario
The following splitting scheme is iteratively repeated at each node:
 1   Select at random Mn candidate coordinates to split on.
 2   If the selection is all weak, then choose one at random to split on.
 3   If there is more than one strong variable elected, choose one at
     random and cut.




      G. Biau (UPMC)                                                   70 / 106
Discussion — Choosing the pnj ’s



Imaginary scenario
The following splitting scheme is iteratively repeated at each node:
 1   Select at random Mn candidate coordinates to split on.
 2   If the selection is all weak, then choose one at random to split on.
 3   If there is more than one strong variable elected, choose one at
     random and cut.




      G. Biau (UPMC)                                                   70 / 106
Discussion — Choosing the pnj ’s

   Each coordinate in S will be cut with the “ideal” probability
                                                 Mn
                                 1       S
                         pn =      1− 1−              .
                                 S       d


   The parameter Mn should satisfy
                                 Mn
                             S
                        1−            log n → 0 as n → ∞.
                             d


   This is true as soon as
                                        Mn
                     Mn → ∞ and              → ∞ as n → ∞.
                                       log n

    G. Biau (UPMC)                                                 71 / 106
Discussion — Choosing the pnj ’s

   Each coordinate in S will be cut with the “ideal” probability
                                                 Mn
                                 1       S
                         pn =      1− 1−              .
                                 S       d


   The parameter Mn should satisfy
                                 Mn
                             S
                        1−            log n → 0 as n → ∞.
                             d


   This is true as soon as
                                        Mn
                     Mn → ∞ and              → ∞ as n → ∞.
                                       log n

    G. Biau (UPMC)                                                 71 / 106
Discussion — Choosing the pnj ’s

   Each coordinate in S will be cut with the “ideal” probability
                                                 Mn
                                 1       S
                         pn =      1− 1−              .
                                 S       d


   The parameter Mn should satisfy
                                 Mn
                             S
                        1−            log n → 0 as n → ∞.
                             d


   This is true as soon as
                                        Mn
                     Mn → ∞ and              → ∞ as n → ∞.
                                       log n

    G. Biau (UPMC)                                                 71 / 106
Assumptions
   We have at hand an independent test set Dn .

   The model is linear:

                            Y =         aj X (j) + ε.
                                  j∈S



   For a fixed node A = d Aj , fix a coordinate j and look at the
                          j=1
   weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ).

   If j ∈ S, then the best split is at the midpoint of the node, with a
   variance decrease equal to aj2 /16 > 0.

   If j ∈ W, the decrease of the variance is always 0, whatever the
   location of the split.


    G. Biau (UPMC)                                                    72 / 106
Assumptions
   We have at hand an independent test set Dn .

   The model is linear:

                            Y =         aj X (j) + ε.
                                  j∈S



   For a fixed node A = d Aj , fix a coordinate j and look at the
                          j=1
   weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ).

   If j ∈ S, then the best split is at the midpoint of the node, with a
   variance decrease equal to aj2 /16 > 0.

   If j ∈ W, the decrease of the variance is always 0, whatever the
   location of the split.


    G. Biau (UPMC)                                                    72 / 106
Assumptions
   We have at hand an independent test set Dn .

   The model is linear:

                            Y =         aj X (j) + ε.
                                  j∈S



   For a fixed node A = d Aj , fix a coordinate j and look at the
                          j=1
   weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ).

   If j ∈ S, then the best split is at the midpoint of the node, with a
   variance decrease equal to aj2 /16 > 0.

   If j ∈ W, the decrease of the variance is always 0, whatever the
   location of the split.


    G. Biau (UPMC)                                                    72 / 106
Assumptions
   We have at hand an independent test set Dn .

   The model is linear:

                            Y =         aj X (j) + ε.
                                  j∈S



   For a fixed node A = d Aj , fix a coordinate j and look at the
                          j=1
   weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ).

   If j ∈ S, then the best split is at the midpoint of the node, with a
   variance decrease equal to aj2 /16 > 0.

   If j ∈ W, the decrease of the variance is always 0, whatever the
   location of the split.


    G. Biau (UPMC)                                                    72 / 106
Assumptions
   We have at hand an independent test set Dn .

   The model is linear:

                            Y =         aj X (j) + ε.
                                  j∈S



   For a fixed node A = d Aj , fix a coordinate j and look at the
                          j=1
   weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ).

   If j ∈ S, then the best split is at the midpoint of the node, with a
   variance decrease equal to aj2 /16 > 0.

   If j ∈ W, the decrease of the variance is always 0, whatever the
   location of the split.


    G. Biau (UPMC)                                                    72 / 106
Assumptions
   We have at hand an independent test set Dn .

   The model is linear:

                            Y =         aj X (j) + ε.
                                  j∈S



   For a fixed node A = d Aj , fix a coordinate j and look at the
                          j=1
   weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ).

   If j ∈ S, then the best split is at the midpoint of the node, with a
   variance decrease equal to aj2 /16 > 0.

   If j ∈ W, the decrease of the variance is always 0, whatever the
   location of the split.


    G. Biau (UPMC)                                                    72 / 106
Discussion — Choosing the pnj ’s

Near-reality scenario
The following splitting scheme is iteratively repeated at each node:
 1   Select at random Mn candidate coordinates to split on.
 2   For each of the Mn elected coordinates, calculate the best split.
 3   Select the coordinate which outputs the best within-node sum of
     squares decrease, and cut.

Conclusion
For j ∈ S,
                                  1
                           pnj ≈     1 + ξnj ,
                                  S
where ξnj → 0 and satisfies the constraint ξnj log n → 0 as n tends to
infinity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞.

      G. Biau (UPMC)                                                   73 / 106
Discussion — Choosing the pnj ’s

Near-reality scenario
The following splitting scheme is iteratively repeated at each node:
 1   Select at random Mn candidate coordinates to split on.
 2   For each of the Mn elected coordinates, calculate the best split.
 3   Select the coordinate which outputs the best within-node sum of
     squares decrease, and cut.

Conclusion
For j ∈ S,
                                  1
                           pnj ≈     1 + ξnj ,
                                  S
where ξnj → 0 and satisfies the constraint ξnj log n → 0 as n tends to
infinity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞.

      G. Biau (UPMC)                                                   73 / 106
Discussion — Choosing the pnj ’s

Near-reality scenario
The following splitting scheme is iteratively repeated at each node:
 1   Select at random Mn candidate coordinates to split on.
 2   For each of the Mn elected coordinates, calculate the best split.
 3   Select the coordinate which outputs the best within-node sum of
     squares decrease, and cut.

Conclusion
For j ∈ S,
                                  1
                           pnj ≈     1 + ξnj ,
                                  S
where ξnj → 0 and satisfies the constraint ξnj log n → 0 as n tends to
infinity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞.

      G. Biau (UPMC)                                                   73 / 106
Discussion — Choosing the pnj ’s

Near-reality scenario
The following splitting scheme is iteratively repeated at each node:
 1   Select at random Mn candidate coordinates to split on.
 2   For each of the Mn elected coordinates, calculate the best split.
 3   Select the coordinate which outputs the best within-node sum of
     squares decrease, and cut.

Conclusion
For j ∈ S,
                                  1
                           pnj ≈     1 + ξnj ,
                                  S
where ξnj → 0 and satisfies the constraint ξnj log n → 0 as n tends to
infinity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞.

      G. Biau (UPMC)                                                   73 / 106
Discussion — Choosing the pnj ’s

Near-reality scenario
The following splitting scheme is iteratively repeated at each node:
 1   Select at random Mn candidate coordinates to split on.
 2   For each of the Mn elected coordinates, calculate the best split.
 3   Select the coordinate which outputs the best within-node sum of
     squares decrease, and cut.

Conclusion
For j ∈ S,
                                  1
                           pnj ≈     1 + ξnj ,
                                  S
where ξnj → 0 and satisfies the constraint ξnj log n → 0 as n tends to
infinity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞.

      G. Biau (UPMC)                                                   73 / 106
Outline




1   Setting


2   A random forests model


3   Layered nearest neighbors and random forests




      G. Biau (UPMC)                               74 / 106
?




G. Biau (UPMC)       75 / 106
?




G. Biau (UPMC)       76 / 106
?




G. Biau (UPMC)       77 / 106
?




G. Biau (UPMC)       78 / 106
?




G. Biau (UPMC)       79 / 106
G. Biau (UPMC)   80 / 106
?




G. Biau (UPMC)       81 / 106
?




G. Biau (UPMC)       82 / 106
?




G. Biau (UPMC)       83 / 106
?




G. Biau (UPMC)       84 / 106
?




G. Biau (UPMC)       85 / 106
G. Biau (UPMC)   86 / 106
?




G. Biau (UPMC)       87 / 106
?




G. Biau (UPMC)       88 / 106
?




G. Biau (UPMC)       89 / 106
?




G. Biau (UPMC)       90 / 106
?




G. Biau (UPMC)       91 / 106
?




G. Biau (UPMC)       92 / 106
?




G. Biau (UPMC)       93 / 106
?




G. Biau (UPMC)       94 / 106
?




G. Biau (UPMC)       95 / 106
?




G. Biau (UPMC)       96 / 106
?




G. Biau (UPMC)       97 / 106
?




G. Biau (UPMC)       98 / 106
Layered Nearest Neighbors
Definition
Let X1 , . . . , Xn be a sample of i.i.d. random vectors in Rd , d ≥ 2. An
observation Xi is said to be a LNN of a point x if the hyperrectangle
defined by x and Xi contains no other data points.




                                     x

                                         empty




      G. Biau (UPMC)                                                     99 / 106
What is known about Ln (x)?


   ... a lot when X1 , . . . , Xn are uniformly distributed over [0, 1]d .

   For example,

                                   2d (log n)d−1
                       ELn (x) =                 + O (log n)d−2
                                     (d − 1)!

   and
                     (d − 1)! Ln (x)
                                     → 1 in probability as n → ∞.
                      2d (log n)d−1

   This is the problem of maxima in random vectors
   (Barndorff-Nielsen and Sobel, 1966).



    G. Biau (UPMC)                                                           100 / 106
What is known about Ln (x)?


   ... a lot when X1 , . . . , Xn are uniformly distributed over [0, 1]d .

   For example,

                                   2d (log n)d−1
                       ELn (x) =                 + O (log n)d−2
                                     (d − 1)!

   and
                     (d − 1)! Ln (x)
                                     → 1 in probability as n → ∞.
                      2d (log n)d−1

   This is the problem of maxima in random vectors
   (Barndorff-Nielsen and Sobel, 1966).



    G. Biau (UPMC)                                                           100 / 106
What is known about Ln (x)?


   ... a lot when X1 , . . . , Xn are uniformly distributed over [0, 1]d .

   For example,

                                   2d (log n)d−1
                       ELn (x) =                 + O (log n)d−2
                                     (d − 1)!

   and
                     (d − 1)! Ln (x)
                                     → 1 in probability as n → ∞.
                      2d (log n)d−1

   This is the problem of maxima in random vectors
   (Barndorff-Nielsen and Sobel, 1966).



    G. Biau (UPMC)                                                           100 / 106
Two results (Biau and Devroye, 2010)
Model
X1 , . . . , Xn are independently distributed according to some probability
density f (with probability measure µ).

Theorem
For µ-almost all x ∈ Rd , one has

                       Ln (x) → ∞ in probability as n → ∞.

Theorem
Suppose that f is λ-almost everywhere continuous. Then

                         (d − 1)! ELn (x)
                                          → 1 as n → ∞,
                          2d (log n)d−1

at µ-almost all x ∈ Rd .
      G. Biau (UPMC)                                                   101 / 106
Two results (Biau and Devroye, 2010)
Model
X1 , . . . , Xn are independently distributed according to some probability
density f (with probability measure µ).

Theorem
For µ-almost all x ∈ Rd , one has

                       Ln (x) → ∞ in probability as n → ∞.

Theorem
Suppose that f is λ-almost everywhere continuous. Then

                         (d − 1)! ELn (x)
                                          → 1 as n → ∞,
                          2d (log n)d−1

at µ-almost all x ∈ Rd .
      G. Biau (UPMC)                                                   101 / 106
Two results (Biau and Devroye, 2010)
Model
X1 , . . . , Xn are independently distributed according to some probability
density f (with probability measure µ).

Theorem
For µ-almost all x ∈ Rd , one has

                       Ln (x) → ∞ in probability as n → ∞.

Theorem
Suppose that f is λ-almost everywhere continuous. Then

                         (d − 1)! ELn (x)
                                          → 1 as n → ∞,
                          2d (log n)d−1

at µ-almost all x ∈ Rd .
      G. Biau (UPMC)                                                   101 / 106
LNN regression estimation

Model
(X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R.
Moreover, |Y | is bounded and X has a density.

The regression function r (x) = E[Y |X = x] may be estimated by
                                            n
                                     1
                        rn (x) =                  Yi 1[Xi ∈Ln (x)] .
                                   Ln (x)
                                            i=1


  1   No smoothing parameter.
  2   A scale-invariant estimate.
  3   Intimately connected to Breiman’s random forests.

       G. Biau (UPMC)                                                          102 / 106
LNN regression estimation

Model
(X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R.
Moreover, |Y | is bounded and X has a density.

The regression function r (x) = E[Y |X = x] may be estimated by
                                            n
                                     1
                        rn (x) =                  Yi 1[Xi ∈Ln (x)] .
                                   Ln (x)
                                            i=1


  1   No smoothing parameter.
  2   A scale-invariant estimate.
  3   Intimately connected to Breiman’s random forests.

       G. Biau (UPMC)                                                          102 / 106
LNN regression estimation

Model
(X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R.
Moreover, |Y | is bounded and X has a density.

The regression function r (x) = E[Y |X = x] may be estimated by
                                            n
                                     1
                        rn (x) =                  Yi 1[Xi ∈Ln (x)] .
                                   Ln (x)
                                            i=1


  1   No smoothing parameter.
  2   A scale-invariant estimate.
  3   Intimately connected to Breiman’s random forests.

       G. Biau (UPMC)                                                          102 / 106
LNN regression estimation

Model
(X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R.
Moreover, |Y | is bounded and X has a density.

The regression function r (x) = E[Y |X = x] may be estimated by
                                            n
                                     1
                        rn (x) =                  Yi 1[Xi ∈Ln (x)] .
                                   Ln (x)
                                            i=1


  1   No smoothing parameter.
  2   A scale-invariant estimate.
  3   Intimately connected to Breiman’s random forests.

       G. Biau (UPMC)                                                          102 / 106
LNN regression estimation

Model
(X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R.
Moreover, |Y | is bounded and X has a density.

The regression function r (x) = E[Y |X = x] may be estimated by
                                            n
                                     1
                        rn (x) =                  Yi 1[Xi ∈Ln (x)] .
                                   Ln (x)
                                            i=1


  1   No smoothing parameter.
  2   A scale-invariant estimate.
  3   Intimately connected to Breiman’s random forests.

       G. Biau (UPMC)                                                          102 / 106
Consistency
Theorem (Pointwise Lp -consistency)
Assume that the regression function r is λ-almost everywhere
continuous and that Y is bounded. Then, for µ-almost all x ∈ Rd and
all p ≥ 1,
                  E |rn (x) − r (x)|p → 0 as n → ∞.

Theorem (Gobal Lp -consistency)
Under the same conditions, for all p ≥ 1,

                       E |rn (X) − r (X)|p → 0 as n → ∞.

 1   No universal consistency result with respect to r is possible.
 2   The results do not impose any condition on the density.
 3   They are also scale-free.
      G. Biau (UPMC)                                                  103 / 106
Consistency
Theorem (Pointwise Lp -consistency)
Assume that the regression function r is λ-almost everywhere
continuous and that Y is bounded. Then, for µ-almost all x ∈ Rd and
all p ≥ 1,
                  E |rn (x) − r (x)|p → 0 as n → ∞.

Theorem (Gobal Lp -consistency)
Under the same conditions, for all p ≥ 1,

                       E |rn (X) − r (X)|p → 0 as n → ∞.

 1   No universal consistency result with respect to r is possible.
 2   The results do not impose any condition on the density.
 3   They are also scale-free.
      G. Biau (UPMC)                                                  103 / 106
Consistency
Theorem (Pointwise Lp -consistency)
Assume that the regression function r is λ-almost everywhere
continuous and that Y is bounded. Then, for µ-almost all x ∈ Rd and
all p ≥ 1,
                  E |rn (x) − r (x)|p → 0 as n → ∞.

Theorem (Gobal Lp -consistency)
Under the same conditions, for all p ≥ 1,

                       E |rn (X) − r (X)|p → 0 as n → ∞.

 1   No universal consistency result with respect to r is possible.
 2   The results do not impose any condition on the density.
 3   They are also scale-free.
      G. Biau (UPMC)                                                  103 / 106
Consistency
Theorem (Pointwise Lp -consistency)
Assume that the regression function r is λ-almost everywhere
continuous and that Y is bounded. Then, for µ-almost all x ∈ Rd and
all p ≥ 1,
                  E |rn (x) − r (x)|p → 0 as n → ∞.

Theorem (Gobal Lp -consistency)
Under the same conditions, for all p ≥ 1,

                       E |rn (X) − r (X)|p → 0 as n → ∞.

 1   No universal consistency result with respect to r is possible.
 2   The results do not impose any condition on the density.
 3   They are also scale-free.
      G. Biau (UPMC)                                                  103 / 106
Consistency
Theorem (Pointwise Lp -consistency)
Assume that the regression function r is λ-almost everywhere
continuous and that Y is bounded. Then, for µ-almost all x ∈ Rd and
all p ≥ 1,
                  E |rn (x) − r (x)|p → 0 as n → ∞.

Theorem (Gobal Lp -consistency)
Under the same conditions, for all p ≥ 1,

                       E |rn (X) − r (X)|p → 0 as n → ∞.

 1   No universal consistency result with respect to r is possible.
 2   The results do not impose any condition on the density.
 3   They are also scale-free.
      G. Biau (UPMC)                                                  103 / 106
Back to random forests



A random forest can be viewed as a weighted LNN regression estimate
                                  n
                       ¯n (x) =
                       r                Yi Wni (x),
                                  i=1

where the weights concentrate on the LNN and satisfy
                           n
                                Wni (x) = 1.
                          i=1




     G. Biau (UPMC)                                            104 / 106
Non-adaptive strategies


Consider the non-adaptive random forests estimate
                                    n
                         ¯n (x) =
                         r                Yi Wni (x).
                                    i=1

where the weights concentrate on the LNN.
Proposition
For any x ∈ Rd , assume that σ 2 = V[Y |X = x] is independent of x.
Then
                                             σ2
                     E [¯n (x) − r (x)]2 ≥
                        r                          .
                                           ELn (x)




     G. Biau (UPMC)                                                   105 / 106
Non-adaptive strategies


Consider the non-adaptive random forests estimate
                                    n
                         ¯n (x) =
                         r                Yi Wni (x).
                                    i=1

where the weights concentrate on the LNN.
Proposition
For any x ∈ Rd , assume that σ 2 = V[Y |X = x] is independent of x.
Then
                                             σ2
                     E [¯n (x) − r (x)]2 ≥
                        r                          .
                                           ELn (x)




     G. Biau (UPMC)                                                   105 / 106
Rate of convergence



At µ-almost all x, when f is λ-almost everywhere continuous,

                                              σ 2 (d − 1)!
                       E [¯n (x) − r (x)]2
                          r                                .
                                             2d (log n)d−1

Improving the rate of convergence
 1   Stop as soon as a future rectangle split would cause a
     sub-rectangle to have fewer than kn points.
 2   Resort to bagging and randomize using random subsamples.




      G. Biau (UPMC)                                            106 / 106
Rate of convergence



At µ-almost all x, when f is λ-almost everywhere continuous,

                                              σ 2 (d − 1)!
                       E [¯n (x) − r (x)]2
                          r                                .
                                             2d (log n)d−1

Improving the rate of convergence
 1   Stop as soon as a future rectangle split would cause a
     sub-rectangle to have fewer than kn points.
 2   Resort to bagging and randomize using random subsamples.




      G. Biau (UPMC)                                            106 / 106
Rate of convergence



At µ-almost all x, when f is λ-almost everywhere continuous,

                                              σ 2 (d − 1)!
                       E [¯n (x) − r (x)]2
                          r                                .
                                             2d (log n)d−1

Improving the rate of convergence
 1   Stop as soon as a future rectangle split would cause a
     sub-rectangle to have fewer than kn points.
 2   Resort to bagging and randomize using random subsamples.




      G. Biau (UPMC)                                            106 / 106

Contenu connexe

En vedette

Prediction in dynamic Graphs
Prediction in dynamic GraphsPrediction in dynamic Graphs
Prediction in dynamic GraphsCdiscount
 
Scm prix blé_2012_11_06
Scm prix blé_2012_11_06Scm prix blé_2012_11_06
Scm prix blé_2012_11_06Cdiscount
 
Scm indicateurs prospectifs_2012_11_06
Scm indicateurs prospectifs_2012_11_06Scm indicateurs prospectifs_2012_11_06
Scm indicateurs prospectifs_2012_11_06Cdiscount
 
State Space Model
State Space ModelState Space Model
State Space ModelCdiscount
 
Prediction of Quantiles by Statistical Learning and Application to GDP Foreca...
Prediction of Quantiles by Statistical Learning and Application to GDP Foreca...Prediction of Quantiles by Statistical Learning and Application to GDP Foreca...
Prediction of Quantiles by Statistical Learning and Application to GDP Foreca...Cdiscount
 
Robust sequentiel learning
Robust sequentiel learningRobust sequentiel learning
Robust sequentiel learningCdiscount
 
Paris2012 session2
Paris2012 session2Paris2012 session2
Paris2012 session2Cdiscount
 
Paris2012 session3b
Paris2012 session3bParis2012 session3b
Paris2012 session3bCdiscount
 
Paris2012 session4
Paris2012 session4Paris2012 session4
Paris2012 session4Cdiscount
 
Paris2012 session1
Paris2012 session1Paris2012 session1
Paris2012 session1Cdiscount
 
Prévisions trafic aérien
Prévisions trafic aérienPrévisions trafic aérien
Prévisions trafic aérienCdiscount
 
Ana maria araujo calderon trabajo social redvolucion leidis
Ana maria araujo calderon trabajo social redvolucion leidisAna maria araujo calderon trabajo social redvolucion leidis
Ana maria araujo calderon trabajo social redvolucion leidisRedvolucionCesarNorte
 
Lean Startup for AaltoES Summer of Startups
Lean Startup for AaltoES Summer of StartupsLean Startup for AaltoES Summer of Startups
Lean Startup for AaltoES Summer of StartupsMarko Taipale
 
Slide 04 - Otaciso, Digelvânia e Silvana
Slide 04 - Otaciso, Digelvânia e SilvanaSlide 04 - Otaciso, Digelvânia e Silvana
Slide 04 - Otaciso, Digelvânia e Silvanarafaelly04
 
Spring 2011 Current Projects
Spring 2011 Current ProjectsSpring 2011 Current Projects
Spring 2011 Current Projectsgrantml
 
3 ways to value your business
3 ways to value your business3 ways to value your business
3 ways to value your businessEquidam
 
JavaScript Unit Testing
JavaScript Unit TestingJavaScript Unit Testing
JavaScript Unit TestingKeir Bowden
 

En vedette (20)

Prediction in dynamic Graphs
Prediction in dynamic GraphsPrediction in dynamic Graphs
Prediction in dynamic Graphs
 
Scm prix blé_2012_11_06
Scm prix blé_2012_11_06Scm prix blé_2012_11_06
Scm prix blé_2012_11_06
 
Scm indicateurs prospectifs_2012_11_06
Scm indicateurs prospectifs_2012_11_06Scm indicateurs prospectifs_2012_11_06
Scm indicateurs prospectifs_2012_11_06
 
State Space Model
State Space ModelState Space Model
State Space Model
 
Prediction of Quantiles by Statistical Learning and Application to GDP Foreca...
Prediction of Quantiles by Statistical Learning and Application to GDP Foreca...Prediction of Quantiles by Statistical Learning and Application to GDP Foreca...
Prediction of Quantiles by Statistical Learning and Application to GDP Foreca...
 
Robust sequentiel learning
Robust sequentiel learningRobust sequentiel learning
Robust sequentiel learning
 
Paris2012 session2
Paris2012 session2Paris2012 session2
Paris2012 session2
 
Scm risques
Scm risquesScm risques
Scm risques
 
Paris2012 session3b
Paris2012 session3bParis2012 session3b
Paris2012 session3b
 
Paris2012 session4
Paris2012 session4Paris2012 session4
Paris2012 session4
 
Paris2012 session1
Paris2012 session1Paris2012 session1
Paris2012 session1
 
Prévisions trafic aérien
Prévisions trafic aérienPrévisions trafic aérien
Prévisions trafic aérien
 
Ana maria araujo calderon trabajo social redvolucion leidis
Ana maria araujo calderon trabajo social redvolucion leidisAna maria araujo calderon trabajo social redvolucion leidis
Ana maria araujo calderon trabajo social redvolucion leidis
 
CA concepts, principles and practices Kitui
CA concepts, principles and practices   KituiCA concepts, principles and practices   Kitui
CA concepts, principles and practices Kitui
 
Lean Startup for AaltoES Summer of Startups
Lean Startup for AaltoES Summer of StartupsLean Startup for AaltoES Summer of Startups
Lean Startup for AaltoES Summer of Startups
 
Инновации в рекламном бизнесе
Инновации в рекламном бизнесеИнновации в рекламном бизнесе
Инновации в рекламном бизнесе
 
Slide 04 - Otaciso, Digelvânia e Silvana
Slide 04 - Otaciso, Digelvânia e SilvanaSlide 04 - Otaciso, Digelvânia e Silvana
Slide 04 - Otaciso, Digelvânia e Silvana
 
Spring 2011 Current Projects
Spring 2011 Current ProjectsSpring 2011 Current Projects
Spring 2011 Current Projects
 
3 ways to value your business
3 ways to value your business3 ways to value your business
3 ways to value your business
 
JavaScript Unit Testing
JavaScript Unit TestingJavaScript Unit Testing
JavaScript Unit Testing
 

Plus de Cdiscount

Presentation r markdown
Presentation r markdown Presentation r markdown
Presentation r markdown Cdiscount
 
R2DOCX : R + WORD
R2DOCX : R + WORDR2DOCX : R + WORD
R2DOCX : R + WORDCdiscount
 
Fltau r interface
Fltau r interfaceFltau r interface
Fltau r interfaceCdiscount
 
Dataiku r users group v2
Dataiku   r users group v2Dataiku   r users group v2
Dataiku r users group v2Cdiscount
 
Introduction à la cartographie avec R
Introduction à la cartographie avec RIntroduction à la cartographie avec R
Introduction à la cartographie avec RCdiscount
 
Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)Cdiscount
 
Premier pas de web scrapping avec R
Premier pas de  web scrapping avec RPremier pas de  web scrapping avec R
Premier pas de web scrapping avec RCdiscount
 
Incorporer du C dans R, créer son package
Incorporer du C dans R, créer son packageIncorporer du C dans R, créer son package
Incorporer du C dans R, créer son packageCdiscount
 
Comptabilité Nationale avec R
Comptabilité Nationale avec RComptabilité Nationale avec R
Comptabilité Nationale avec RCdiscount
 
Cartographie avec igraph sous R (Partie 2)
Cartographie avec igraph sous R (Partie 2)Cartographie avec igraph sous R (Partie 2)
Cartographie avec igraph sous R (Partie 2)Cdiscount
 
Cartographie avec igraph sous R (Partie 1)
Cartographie avec igraph sous R (Partie 1) Cartographie avec igraph sous R (Partie 1)
Cartographie avec igraph sous R (Partie 1) Cdiscount
 
RStudio is good for you
RStudio is good for youRStudio is good for you
RStudio is good for youCdiscount
 
R fait du la tex
R fait du la texR fait du la tex
R fait du la texCdiscount
 
Première approche de cartographie sous R
Première approche de cartographie sous RPremière approche de cartographie sous R
Première approche de cartographie sous RCdiscount
 

Plus de Cdiscount (17)

R Devtools
R DevtoolsR Devtools
R Devtools
 
Presentation r markdown
Presentation r markdown Presentation r markdown
Presentation r markdown
 
R2DOCX : R + WORD
R2DOCX : R + WORDR2DOCX : R + WORD
R2DOCX : R + WORD
 
Gur1009
Gur1009Gur1009
Gur1009
 
Fltau r interface
Fltau r interfaceFltau r interface
Fltau r interface
 
Dataiku r users group v2
Dataiku   r users group v2Dataiku   r users group v2
Dataiku r users group v2
 
Introduction à la cartographie avec R
Introduction à la cartographie avec RIntroduction à la cartographie avec R
Introduction à la cartographie avec R
 
HADOOP + R
HADOOP + RHADOOP + R
HADOOP + R
 
Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)Parallel R in snow (english after 2nd slide)
Parallel R in snow (english after 2nd slide)
 
Premier pas de web scrapping avec R
Premier pas de  web scrapping avec RPremier pas de  web scrapping avec R
Premier pas de web scrapping avec R
 
Incorporer du C dans R, créer son package
Incorporer du C dans R, créer son packageIncorporer du C dans R, créer son package
Incorporer du C dans R, créer son package
 
Comptabilité Nationale avec R
Comptabilité Nationale avec RComptabilité Nationale avec R
Comptabilité Nationale avec R
 
Cartographie avec igraph sous R (Partie 2)
Cartographie avec igraph sous R (Partie 2)Cartographie avec igraph sous R (Partie 2)
Cartographie avec igraph sous R (Partie 2)
 
Cartographie avec igraph sous R (Partie 1)
Cartographie avec igraph sous R (Partie 1) Cartographie avec igraph sous R (Partie 1)
Cartographie avec igraph sous R (Partie 1)
 
RStudio is good for you
RStudio is good for youRStudio is good for you
RStudio is good for you
 
R fait du la tex
R fait du la texR fait du la tex
R fait du la tex
 
Première approche de cartographie sous R
Première approche de cartographie sous RPremière approche de cartographie sous R
Première approche de cartographie sous R
 

Dernier

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Dernier (20)

08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Présentation G.Biau Random Forests

  • 1. Random forests Gérard Biau Groupe de travail prévision, May 2011 G. Biau (UPMC) 1 / 106
  • 2. Outline 1 Setting 2 A random forests model 3 Layered nearest neighbors and random forests G. Biau (UPMC) 2 / 106
  • 3. Outline 1 Setting 2 A random forests model 3 Layered nearest neighbors and random forests G. Biau (UPMC) 3 / 106
  • 4. Leo Breiman (1928-2005) G. Biau (UPMC) 4 / 106
  • 5. Leo Breiman (1928-2005) G. Biau (UPMC) 5 / 106
  • 6. Classification G. Biau (UPMC) 6 / 106
  • 7. Classification ? G. Biau (UPMC) 7 / 106
  • 8. Classification G. Biau (UPMC) 8 / 106
  • 9. Regression −2.33 1.38 6.76 5.89 12.00 0.56 2.27 0.12 4.73 31.2 −2.88 −2.56 7.72 7.26 0.12 4.32 2.23 7.77 −1.23 3.82 1.09 8.21 −2.66 G. Biau (UPMC) 9 / 106
  • 10. Regression −2.33 1.38 6.76 5.89 12.00 0.56 2.27 0.12 4.73 31.2 −2.88 7.72 ? −2.56 7.26 0.12 4.32 2.23 7.77 −1.23 3.82 1.09 8.21 −2.66 G. Biau (UPMC) 10 / 106
  • 11. Regression −2.33 1.38 6.76 5.89 12.00 0.56 2.27 0.12 4.73 31.2 −2.88 2.25 −2.56 7.72 7.26 0.12 4.32 2.23 7.77 −1.23 3.82 1.09 8.21 −2.66 G. Biau (UPMC) 11 / 106
  • 12. Trees ? G. Biau (UPMC) 12 / 106
  • 13. Trees ? G. Biau (UPMC) 13 / 106
  • 14. Trees ? G. Biau (UPMC) 14 / 106
  • 15. Trees ? G. Biau (UPMC) 15 / 106
  • 16. Trees ? G. Biau (UPMC) 16 / 106
  • 17. Trees ? G. Biau (UPMC) 17 / 106
  • 18. Trees ? G. Biau (UPMC) 18 / 106
  • 19. Trees G. Biau (UPMC) 19 / 106
  • 20. Trees G. Biau (UPMC) 20 / 106
  • 21. Trees G. Biau (UPMC) 21 / 106
  • 22. Trees G. Biau (UPMC) 22 / 106
  • 23. Trees G. Biau (UPMC) 23 / 106
  • 24. Trees G. Biau (UPMC) 24 / 106
  • 25. Trees G. Biau (UPMC) 25 / 106
  • 26. Trees G. Biau (UPMC) 26 / 106
  • 27. Trees ? G. Biau (UPMC) 27 / 106
  • 28. Trees ? G. Biau (UPMC) 28 / 106
  • 29. Trees ? G. Biau (UPMC) 29 / 106
  • 30. Trees ? G. Biau (UPMC) 30 / 106
  • 31. Trees ? G. Biau (UPMC) 31 / 106
  • 32. Trees ? G. Biau (UPMC) 32 / 106
  • 33. Trees ? G. Biau (UPMC) 33 / 106
  • 34. From trees to forests Leo Breiman promoted random forests. Idea: Using tree averaging as a means of obtaining good rules. The base trees are simple and randomized. Breiman’s ideas were decisively influenced by Amit and Geman (1997, geometric feature selection). Ho (1998, random subspace method). Dietterich (2000, random split selection approach). stat.berkeley.edu/users/breiman/RandomForests G. Biau (UPMC) 34 / 106
  • 35. From trees to forests Leo Breiman promoted random forests. Idea: Using tree averaging as a means of obtaining good rules. The base trees are simple and randomized. Breiman’s ideas were decisively influenced by Amit and Geman (1997, geometric feature selection). Ho (1998, random subspace method). Dietterich (2000, random split selection approach). stat.berkeley.edu/users/breiman/RandomForests G. Biau (UPMC) 34 / 106
  • 36. From trees to forests Leo Breiman promoted random forests. Idea: Using tree averaging as a means of obtaining good rules. The base trees are simple and randomized. Breiman’s ideas were decisively influenced by Amit and Geman (1997, geometric feature selection). Ho (1998, random subspace method). Dietterich (2000, random split selection approach). stat.berkeley.edu/users/breiman/RandomForests G. Biau (UPMC) 34 / 106
  • 37. From trees to forests Leo Breiman promoted random forests. Idea: Using tree averaging as a means of obtaining good rules. The base trees are simple and randomized. Breiman’s ideas were decisively influenced by Amit and Geman (1997, geometric feature selection). Ho (1998, random subspace method). Dietterich (2000, random split selection approach). stat.berkeley.edu/users/breiman/RandomForests G. Biau (UPMC) 34 / 106
  • 38. From trees to forests Leo Breiman promoted random forests. Idea: Using tree averaging as a means of obtaining good rules. The base trees are simple and randomized. Breiman’s ideas were decisively influenced by Amit and Geman (1997, geometric feature selection). Ho (1998, random subspace method). Dietterich (2000, random split selection approach). stat.berkeley.edu/users/breiman/RandomForests G. Biau (UPMC) 34 / 106
  • 39. Random forests They have emerged as serious competitors to state of the art methods. They are fast and easy to implement, produce highly accurate predictions and can handle a very large number of input variables without overfitting. In fact, forests are among the most accurate general-purpose learners available. The algorithm is difficult to analyze and its mathematical properties remain to date largely unknown. Most theoretical studies have concentrated on isolated parts or stylized versions of the procedure. G. Biau (UPMC) 35 / 106
  • 40. Random forests They have emerged as serious competitors to state of the art methods. They are fast and easy to implement, produce highly accurate predictions and can handle a very large number of input variables without overfitting. In fact, forests are among the most accurate general-purpose learners available. The algorithm is difficult to analyze and its mathematical properties remain to date largely unknown. Most theoretical studies have concentrated on isolated parts or stylized versions of the procedure. G. Biau (UPMC) 35 / 106
  • 41. Random forests They have emerged as serious competitors to state of the art methods. They are fast and easy to implement, produce highly accurate predictions and can handle a very large number of input variables without overfitting. In fact, forests are among the most accurate general-purpose learners available. The algorithm is difficult to analyze and its mathematical properties remain to date largely unknown. Most theoretical studies have concentrated on isolated parts or stylized versions of the procedure. G. Biau (UPMC) 35 / 106
  • 42. Random forests They have emerged as serious competitors to state of the art methods. They are fast and easy to implement, produce highly accurate predictions and can handle a very large number of input variables without overfitting. In fact, forests are among the most accurate general-purpose learners available. The algorithm is difficult to analyze and its mathematical properties remain to date largely unknown. Most theoretical studies have concentrated on isolated parts or stylized versions of the procedure. G. Biau (UPMC) 35 / 106
  • 43. Random forests They have emerged as serious competitors to state of the art methods. They are fast and easy to implement, produce highly accurate predictions and can handle a very large number of input variables without overfitting. In fact, forests are among the most accurate general-purpose learners available. The algorithm is difficult to analyze and its mathematical properties remain to date largely unknown. Most theoretical studies have concentrated on isolated parts or stylized versions of the procedure. G. Biau (UPMC) 35 / 106
  • 44. Key-references Breiman (2000, 2001, 2004). Definition, experiments and intuitions. Lin and Jeon (2006). Link with layered nearest neighbors. Biau, Devroye and Lugosi (2008). Consistency results for stylized versions. Biau (2010). Sparsity and random forests. G. Biau (UPMC) 36 / 106
  • 45. Key-references Breiman (2000, 2001, 2004). Definition, experiments and intuitions. Lin and Jeon (2006). Link with layered nearest neighbors. Biau, Devroye and Lugosi (2008). Consistency results for stylized versions. Biau (2010). Sparsity and random forests. G. Biau (UPMC) 36 / 106
  • 46. Key-references Breiman (2000, 2001, 2004). Definition, experiments and intuitions. Lin and Jeon (2006). Link with layered nearest neighbors. Biau, Devroye and Lugosi (2008). Consistency results for stylized versions. Biau (2010). Sparsity and random forests. G. Biau (UPMC) 36 / 106
  • 47. Key-references Breiman (2000, 2001, 2004). Definition, experiments and intuitions. Lin and Jeon (2006). Link with layered nearest neighbors. Biau, Devroye and Lugosi (2008). Consistency results for stylized versions. Biau (2010). Sparsity and random forests. G. Biau (UPMC) 36 / 106
  • 48. Outline 1 Setting 2 A random forests model 3 Layered nearest neighbors and random forests G. Biau (UPMC) 37 / 106
  • 49. Three basic ingredients 1-Randomization and no-pruning For each tree, select at random, at each node, a small group of input coordinates to split. Calculate the best split based on these features and cut. The tree is grown to maximum size, without pruning. G. Biau (UPMC) 38 / 106
  • 50. Three basic ingredients 1-Randomization and no-pruning For each tree, select at random, at each node, a small group of input coordinates to split. Calculate the best split based on these features and cut. The tree is grown to maximum size, without pruning. G. Biau (UPMC) 38 / 106
  • 51. Three basic ingredients 1-Randomization and no-pruning For each tree, select at random, at each node, a small group of input coordinates to split. Calculate the best split based on these features and cut. The tree is grown to maximum size, without pruning. G. Biau (UPMC) 38 / 106
  • 52. Three basic ingredients 2-Aggregation Final predictions are obtained by aggregating over the ensemble. It is fast and easily parallelizable. G. Biau (UPMC) 39 / 106
  • 53. Three basic ingredients 2-Aggregation Final predictions are obtained by aggregating over the ensemble. It is fast and easily parallelizable. G. Biau (UPMC) 39 / 106
  • 54. Three basic ingredients 3-Bagging The subspace randomization scheme is blended with bagging. Breiman (1996). Bühlmann and Yu (2002). Biau, Cérou and Guyader (2010). G. Biau (UPMC) 40 / 106
  • 55. Three basic ingredients 3-Bagging The subspace randomization scheme is blended with bagging. Breiman (1996). Bühlmann and Yu (2002). Biau, Cérou and Guyader (2010). G. Biau (UPMC) 40 / 106
  • 56. Mathematical framework A training sample: Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d. [0, 1]d × R-valued random variables. A generic pair: (X, Y ) satisfying EY 2 < ∞. Our mission: For fixed x ∈ [0, 1]d , estimate the regression function r (x) = E[Y |X = x] using the data. Quality criterion: E[rn (X) − r (X)]2 . G. Biau (UPMC) 41 / 106
  • 57. Mathematical framework A training sample: Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d. [0, 1]d × R-valued random variables. A generic pair: (X, Y ) satisfying EY 2 < ∞. Our mission: For fixed x ∈ [0, 1]d , estimate the regression function r (x) = E[Y |X = x] using the data. Quality criterion: E[rn (X) − r (X)]2 . G. Biau (UPMC) 41 / 106
  • 58. Mathematical framework A training sample: Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d. [0, 1]d × R-valued random variables. A generic pair: (X, Y ) satisfying EY 2 < ∞. Our mission: For fixed x ∈ [0, 1]d , estimate the regression function r (x) = E[Y |X = x] using the data. Quality criterion: E[rn (X) − r (X)]2 . G. Biau (UPMC) 41 / 106
  • 59. Mathematical framework A training sample: Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d. [0, 1]d × R-valued random variables. A generic pair: (X, Y ) satisfying EY 2 < ∞. Our mission: For fixed x ∈ [0, 1]d , estimate the regression function r (x) = E[Y |X = x] using the data. Quality criterion: E[rn (X) − r (X)]2 . G. Biau (UPMC) 41 / 106
  • 60. The model A random forest is a collection of randomized base regression trees {rn (x, Θm , Dn ), m ≥ 1}. These random trees are combined to form the aggregated regression estimate ¯n (X, Dn ) = EΘ [rn (X, Θ, Dn )] . r Θ is assumed to be independent of X and the training sample Dn . However, we allow Θ to be based on a test sample, independent of, but distributed as, Dn . G. Biau (UPMC) 42 / 106
  • 61. The model A random forest is a collection of randomized base regression trees {rn (x, Θm , Dn ), m ≥ 1}. These random trees are combined to form the aggregated regression estimate ¯n (X, Dn ) = EΘ [rn (X, Θ, Dn )] . r Θ is assumed to be independent of X and the training sample Dn . However, we allow Θ to be based on a test sample, independent of, but distributed as, Dn . G. Biau (UPMC) 42 / 106
  • 62. The model A random forest is a collection of randomized base regression trees {rn (x, Θm , Dn ), m ≥ 1}. These random trees are combined to form the aggregated regression estimate ¯n (X, Dn ) = EΘ [rn (X, Θ, Dn )] . r Θ is assumed to be independent of X and the training sample Dn . However, we allow Θ to be based on a test sample, independent of, but distributed as, Dn . G. Biau (UPMC) 42 / 106
  • 63. The model A random forest is a collection of randomized base regression trees {rn (x, Θm , Dn ), m ≥ 1}. These random trees are combined to form the aggregated regression estimate ¯n (X, Dn ) = EΘ [rn (X, Θ, Dn )] . r Θ is assumed to be independent of X and the training sample Dn . However, we allow Θ to be based on a test sample, independent of, but distributed as, Dn . G. Biau (UPMC) 42 / 106
  • 64. The procedure Fix kn ≥ 2 and repeat the following procedure log2 kn times: 1 At each node, a coordinate of X = (X (1) , . . . , X (d) ) is selected, with the j-th feature having a probability pnj ∈ (0, 1) of being selected. 2 At each node, once the coordinate is selected, the split is at the midpoint of the chosen side. Thus n i=1 Yi 1[Xi ∈An (X,Θ)] ¯n (X) = EΘ r n 1En (X,Θ) , i=1 1[Xi ∈An (X,Θ)] where n En (X, Θ) = 1[Xi ∈An (X,Θ)] = 0 . i=1 G. Biau (UPMC) 43 / 106
  • 65. The procedure Fix kn ≥ 2 and repeat the following procedure log2 kn times: 1 At each node, a coordinate of X = (X (1) , . . . , X (d) ) is selected, with the j-th feature having a probability pnj ∈ (0, 1) of being selected. 2 At each node, once the coordinate is selected, the split is at the midpoint of the chosen side. Thus n i=1 Yi 1[Xi ∈An (X,Θ)] ¯n (X) = EΘ r n 1En (X,Θ) , i=1 1[Xi ∈An (X,Θ)] where n En (X, Θ) = 1[Xi ∈An (X,Θ)] = 0 . i=1 G. Biau (UPMC) 43 / 106
  • 66. The procedure Fix kn ≥ 2 and repeat the following procedure log2 kn times: 1 At each node, a coordinate of X = (X (1) , . . . , X (d) ) is selected, with the j-th feature having a probability pnj ∈ (0, 1) of being selected. 2 At each node, once the coordinate is selected, the split is at the midpoint of the chosen side. Thus n i=1 Yi 1[Xi ∈An (X,Θ)] ¯n (X) = EΘ r n 1En (X,Θ) , i=1 1[Xi ∈An (X,Θ)] where n En (X, Θ) = 1[Xi ∈An (X,Θ)] = 0 . i=1 G. Biau (UPMC) 43 / 106
  • 67. Binary trees G. Biau (UPMC) 44 / 106
  • 68. Binary trees G. Biau (UPMC) 45 / 106
  • 69. Binary trees G. Biau (UPMC) 46 / 106
  • 70. Binary trees G. Biau (UPMC) 47 / 106
  • 71. Binary trees G. Biau (UPMC) 48 / 106
  • 72. General comments Each individual tree has exactly 2 log2 kn (≈ kn ) terminal nodes, and each leaf has Lebesgue measure 2− log2 kn (≈ 1/kn ). If X has uniform distribution on [0, 1]d , there will be on average about n/kn observations per terminal node. The choice kn = n induces a very small number of cases in the final leaves. This scheme is close to what the original random forests algorithm does. G. Biau (UPMC) 49 / 106
  • 73. General comments Each individual tree has exactly 2 log2 kn (≈ kn ) terminal nodes, and each leaf has Lebesgue measure 2− log2 kn (≈ 1/kn ). If X has uniform distribution on [0, 1]d , there will be on average about n/kn observations per terminal node. The choice kn = n induces a very small number of cases in the final leaves. This scheme is close to what the original random forests algorithm does. G. Biau (UPMC) 49 / 106
  • 74. General comments Each individual tree has exactly 2 log2 kn (≈ kn ) terminal nodes, and each leaf has Lebesgue measure 2− log2 kn (≈ 1/kn ). If X has uniform distribution on [0, 1]d , there will be on average about n/kn observations per terminal node. The choice kn = n induces a very small number of cases in the final leaves. This scheme is close to what the original random forests algorithm does. G. Biau (UPMC) 49 / 106
  • 75. General comments Each individual tree has exactly 2 log2 kn (≈ kn ) terminal nodes, and each leaf has Lebesgue measure 2− log2 kn (≈ 1/kn ). If X has uniform distribution on [0, 1]d , there will be on average about n/kn observations per terminal node. The choice kn = n induces a very small number of cases in the final leaves. This scheme is close to what the original random forests algorithm does. G. Biau (UPMC) 49 / 106
  • 76. Consistency Theorem Assume that the distribution of X has support on [0, 1]d . Then the random forests estimate ¯n is consistent whenever pnj log kn → ∞ for r all j = 1, . . . , d and kn /n → 0 as n → ∞. In the purely random model, pnj = 1/d, independently of n and j, and consistency is ensured as long as kn → ∞ and kn /n → 0. This is however a radically simplified version of the random forests used in practice. A more in-depth analysis is needed. G. Biau (UPMC) 50 / 106
  • 77. Consistency Theorem Assume that the distribution of X has support on [0, 1]d . Then the random forests estimate ¯n is consistent whenever pnj log kn → ∞ for r all j = 1, . . . , d and kn /n → 0 as n → ∞. In the purely random model, pnj = 1/d, independently of n and j, and consistency is ensured as long as kn → ∞ and kn /n → 0. This is however a radically simplified version of the random forests used in practice. A more in-depth analysis is needed. G. Biau (UPMC) 50 / 106
  • 78. Consistency Theorem Assume that the distribution of X has support on [0, 1]d . Then the random forests estimate ¯n is consistent whenever pnj log kn → ∞ for r all j = 1, . . . , d and kn /n → 0 as n → ∞. In the purely random model, pnj = 1/d, independently of n and j, and consistency is ensured as long as kn → ∞ and kn /n → 0. This is however a radically simplified version of the random forests used in practice. A more in-depth analysis is needed. G. Biau (UPMC) 50 / 106
  • 79. Consistency Theorem Assume that the distribution of X has support on [0, 1]d . Then the random forests estimate ¯n is consistent whenever pnj log kn → ∞ for r all j = 1, . . . , d and kn /n → 0 as n → ∞. In the purely random model, pnj = 1/d, independently of n and j, and consistency is ensured as long as kn → ∞ and kn /n → 0. This is however a radically simplified version of the random forests used in practice. A more in-depth analysis is needed. G. Biau (UPMC) 50 / 106
  • 80. Sparsity There is empirical evidence that many signals in high-dimensional spaces admit a sparse representation. Images wavelet coefficients. High-throughput technologies. Sparse estimation is playing an increasingly important role in the statistics and machine learning communities. Several methods have recently been developed in both fields, which rely upon the notion of sparsity. G. Biau (UPMC) 51 / 106
  • 81. Sparsity There is empirical evidence that many signals in high-dimensional spaces admit a sparse representation. Images wavelet coefficients. High-throughput technologies. Sparse estimation is playing an increasingly important role in the statistics and machine learning communities. Several methods have recently been developed in both fields, which rely upon the notion of sparsity. G. Biau (UPMC) 51 / 106
  • 82. Sparsity There is empirical evidence that many signals in high-dimensional spaces admit a sparse representation. Images wavelet coefficients. High-throughput technologies. Sparse estimation is playing an increasingly important role in the statistics and machine learning communities. Several methods have recently been developed in both fields, which rely upon the notion of sparsity. G. Biau (UPMC) 51 / 106
  • 83. Sparsity There is empirical evidence that many signals in high-dimensional spaces admit a sparse representation. Images wavelet coefficients. High-throughput technologies. Sparse estimation is playing an increasingly important role in the statistics and machine learning communities. Several methods have recently been developed in both fields, which rely upon the notion of sparsity. G. Biau (UPMC) 51 / 106
  • 84. Sparsity There is empirical evidence that many signals in high-dimensional spaces admit a sparse representation. Images wavelet coefficients. High-throughput technologies. Sparse estimation is playing an increasingly important role in the statistics and machine learning communities. Several methods have recently been developed in both fields, which rely upon the notion of sparsity. G. Biau (UPMC) 51 / 106
  • 85. Our vision The regression function r (X) = E[Y |X] depends in fact only on a nonempty subset S (for Strong) of the d features. In other words, letting XS = (Xj : j ∈ S) and S = Card S, we have r (X) = E[Y |XS ]. In the dimension reduction scenario we have in mind, the ambient dimension d can be very large, much larger than n. As such, the value S characterizes the sparsity of the model: The smaller S, the sparser r . G. Biau (UPMC) 52 / 106
  • 86. Our vision The regression function r (X) = E[Y |X] depends in fact only on a nonempty subset S (for Strong) of the d features. In other words, letting XS = (Xj : j ∈ S) and S = Card S, we have r (X) = E[Y |XS ]. In the dimension reduction scenario we have in mind, the ambient dimension d can be very large, much larger than n. As such, the value S characterizes the sparsity of the model: The smaller S, the sparser r . G. Biau (UPMC) 52 / 106
  • 87. Our vision The regression function r (X) = E[Y |X] depends in fact only on a nonempty subset S (for Strong) of the d features. In other words, letting XS = (Xj : j ∈ S) and S = Card S, we have r (X) = E[Y |XS ]. In the dimension reduction scenario we have in mind, the ambient dimension d can be very large, much larger than n. As such, the value S characterizes the sparsity of the model: The smaller S, the sparser r . G. Biau (UPMC) 52 / 106
  • 88. Our vision The regression function r (X) = E[Y |X] depends in fact only on a nonempty subset S (for Strong) of the d features. In other words, letting XS = (Xj : j ∈ S) and S = Card S, we have r (X) = E[Y |XS ]. In the dimension reduction scenario we have in mind, the ambient dimension d can be very large, much larger than n. As such, the value S characterizes the sparsity of the model: The smaller S, the sparser r . G. Biau (UPMC) 52 / 106
  • 89. Sparsity and random forests Ideally, pnj = 1/S for j ∈ S. To stick to reality, we will rather require that pnj = (1/S)(1 + ξnj ). Such a randomization mechanism may be designed on the basis of a test sample. Action plan E [¯n (X) − r (X)]2 = E [¯n (X) − ˜n (X)]2 + E [˜n (X) − r (X)]2 , r r r r where n ˜n (X) = r EΘ [Wni (X, Θ)] r (Xi ). i=1 G. Biau (UPMC) 53 / 106
  • 90. Sparsity and random forests Ideally, pnj = 1/S for j ∈ S. To stick to reality, we will rather require that pnj = (1/S)(1 + ξnj ). Such a randomization mechanism may be designed on the basis of a test sample. Action plan E [¯n (X) − r (X)]2 = E [¯n (X) − ˜n (X)]2 + E [˜n (X) − r (X)]2 , r r r r where n ˜n (X) = r EΘ [Wni (X, Θ)] r (Xi ). i=1 G. Biau (UPMC) 53 / 106
  • 91. Sparsity and random forests Ideally, pnj = 1/S for j ∈ S. To stick to reality, we will rather require that pnj = (1/S)(1 + ξnj ). Such a randomization mechanism may be designed on the basis of a test sample. Action plan E [¯n (X) − r (X)]2 = E [¯n (X) − ˜n (X)]2 + E [˜n (X) − r (X)]2 , r r r r where n ˜n (X) = r EΘ [Wni (X, Θ)] r (Xi ). i=1 G. Biau (UPMC) 53 / 106
  • 92. Sparsity and random forests Ideally, pnj = 1/S for j ∈ S. To stick to reality, we will rather require that pnj = (1/S)(1 + ξnj ). Such a randomization mechanism may be designed on the basis of a test sample. Action plan E [¯n (X) − r (X)]2 = E [¯n (X) − ˜n (X)]2 + E [˜n (X) − r (X)]2 , r r r r where n ˜n (X) = r EΘ [Wni (X, Θ)] r (Xi ). i=1 G. Biau (UPMC) 53 / 106
  • 93. Variance Proposition Assume that X is uniformly distributed on [0, 1]d and, for all x ∈ Rd , σ 2 (x) = V[Y | X = x] ≤ σ 2 for some positive constant σ 2 . Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S, S/2d S2 kn E [¯n (X) − ˜n (X)]2 ≤ Cσ 2 r r (1 + ξn ) , S−1 n(log kn )S/2d where S/2d 288 π log 2 C= . π 16 G. Biau (UPMC) 54 / 106
  • 94. Bias Proposition Assume that X is uniformly distributed on [0, 1]d and r is L-Lipschitz. Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S, 2SL2 E [˜n (X) − r (X)]2 ≤ r 0.75 + sup r 2 (x) e−n/2kn . (1+γn ) kn S log 2 x∈[0,1]d The rate at which the bias decreases to 0 depends on the number of strong variables, not on d. kn −(0.75/(S log 2))(1+γn ) = o(kn −2/d ) as soon as S ≤ 0.54d . The term e−n/2kn prevents the extreme choice kn = n. G. Biau (UPMC) 55 / 106
  • 95. Bias Proposition Assume that X is uniformly distributed on [0, 1]d and r is L-Lipschitz. Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S, 2SL2 E [˜n (X) − r (X)]2 ≤ r 0.75 + sup r 2 (x) e−n/2kn . (1+γn ) kn S log 2 x∈[0,1]d The rate at which the bias decreases to 0 depends on the number of strong variables, not on d. kn −(0.75/(S log 2))(1+γn ) = o(kn −2/d ) as soon as S ≤ 0.54d . The term e−n/2kn prevents the extreme choice kn = n. G. Biau (UPMC) 55 / 106
  • 96. Bias Proposition Assume that X is uniformly distributed on [0, 1]d and r is L-Lipschitz. Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S, 2SL2 E [˜n (X) − r (X)]2 ≤ r 0.75 + sup r 2 (x) e−n/2kn . (1+γn ) kn S log 2 x∈[0,1]d The rate at which the bias decreases to 0 depends on the number of strong variables, not on d. kn −(0.75/(S log 2))(1+γn ) = o(kn −2/d ) as soon as S ≤ 0.54d . The term e−n/2kn prevents the extreme choice kn = n. G. Biau (UPMC) 55 / 106
  • 97. Bias Proposition Assume that X is uniformly distributed on [0, 1]d and r is L-Lipschitz. Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S, 2SL2 E [˜n (X) − r (X)]2 ≤ r 0.75 + sup r 2 (x) e−n/2kn . (1+γn ) kn S log 2 x∈[0,1]d The rate at which the bias decreases to 0 depends on the number of strong variables, not on d. kn −(0.75/(S log 2))(1+γn ) = o(kn −2/d ) as soon as S ≤ 0.54d . The term e−n/2kn prevents the extreme choice kn = n. G. Biau (UPMC) 55 / 106
  • 98. Main result Theorem If pnj = (1/S)(1 + ξnj ) for j ∈ S, with ξnj log n → 0 as n → ∞, then for the choice 1/(1+ S0.752 ) L2 log 1/(1+ S0.752 ) kn ∝ n log , Ξ we have E [¯n (X) − r (X)]2 r lim sup sup 0.75 ≤ Λ. n→∞ (X,Y )∈F 2S ln 2 S ln 2+0.75 −0.75 ΞL 0.75 n S log 2+0.75 Take-home message −0.75 The rate n S log 2+0.75 is strictly faster than the usual minimax rate n−2/(d+2) as soon as S ≤ 0.54d . G. Biau (UPMC) 56 / 106
  • 99. Main result Theorem If pnj = (1/S)(1 + ξnj ) for j ∈ S, with ξnj log n → 0 as n → ∞, then for the choice 1/(1+ S0.752 ) L2 log 1/(1+ S0.752 ) kn ∝ n log , Ξ we have E [¯n (X) − r (X)]2 r lim sup sup 0.75 ≤ Λ. n→∞ (X,Y )∈F 2S ln 2 S ln 2+0.75 −0.75 ΞL 0.75 n S log 2+0.75 Take-home message −0.75 The rate n S log 2+0.75 is strictly faster than the usual minimax rate n−2/(d+2) as soon as S ≤ 0.54d . G. Biau (UPMC) 56 / 106
  • 100. Dimension reduction 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0.0196 0 2 10 20 30 40 50 60 70 80 90 100 S G. Biau (UPMC) 57 / 106
  • 101. Discussion — Bagging The optimal parameter kn depends on the unknown distribution of (X, Y ). To correct this situation, adaptive choices of kn should preserve the rate of convergence of the estimate. Another route we may follow is to analyze the effect of bagging. Biau, Cérou and Guyader (2010). G. Biau (UPMC) 58 / 106
  • 102. Discussion — Bagging The optimal parameter kn depends on the unknown distribution of (X, Y ). To correct this situation, adaptive choices of kn should preserve the rate of convergence of the estimate. Another route we may follow is to analyze the effect of bagging. Biau, Cérou and Guyader (2010). G. Biau (UPMC) 58 / 106
  • 103. Discussion — Bagging The optimal parameter kn depends on the unknown distribution of (X, Y ). To correct this situation, adaptive choices of kn should preserve the rate of convergence of the estimate. Another route we may follow is to analyze the effect of bagging. Biau, Cérou and Guyader (2010). G. Biau (UPMC) 58 / 106
  • 104. Discussion — Bagging The optimal parameter kn depends on the unknown distribution of (X, Y ). To correct this situation, adaptive choices of kn should preserve the rate of convergence of the estimate. Another route we may follow is to analyze the effect of bagging. Biau, Cérou and Guyader (2010). G. Biau (UPMC) 58 / 106
  • 105. ? G. Biau (UPMC) 59 / 106
  • 106. ? G. Biau (UPMC) 60 / 106
  • 107. G. Biau (UPMC) 61 / 106
  • 108. ? G. Biau (UPMC) 62 / 106
  • 109. ? G. Biau (UPMC) 63 / 106
  • 110. ? G. Biau (UPMC) 64 / 106
  • 111. G. Biau (UPMC) 65 / 106
  • 112. ? G. Biau (UPMC) 66 / 106
  • 113. ? G. Biau (UPMC) 67 / 106
  • 114. ? G. Biau (UPMC) 68 / 106
  • 115. G. Biau (UPMC) 69 / 106
  • 116. Discussion — Choosing the pnj ’s Imaginary scenario The following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 If the selection is all weak, then choose one at random to split on. 3 If there is more than one strong variable elected, choose one at random and cut. G. Biau (UPMC) 70 / 106
  • 117. Discussion — Choosing the pnj ’s Imaginary scenario The following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 If the selection is all weak, then choose one at random to split on. 3 If there is more than one strong variable elected, choose one at random and cut. G. Biau (UPMC) 70 / 106
  • 118. Discussion — Choosing the pnj ’s Imaginary scenario The following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 If the selection is all weak, then choose one at random to split on. 3 If there is more than one strong variable elected, choose one at random and cut. G. Biau (UPMC) 70 / 106
  • 119. Discussion — Choosing the pnj ’s Imaginary scenario The following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 If the selection is all weak, then choose one at random to split on. 3 If there is more than one strong variable elected, choose one at random and cut. G. Biau (UPMC) 70 / 106
  • 120. Discussion — Choosing the pnj ’s Each coordinate in S will be cut with the “ideal” probability Mn 1 S pn = 1− 1− . S d The parameter Mn should satisfy Mn S 1− log n → 0 as n → ∞. d This is true as soon as Mn Mn → ∞ and → ∞ as n → ∞. log n G. Biau (UPMC) 71 / 106
  • 121. Discussion — Choosing the pnj ’s Each coordinate in S will be cut with the “ideal” probability Mn 1 S pn = 1− 1− . S d The parameter Mn should satisfy Mn S 1− log n → 0 as n → ∞. d This is true as soon as Mn Mn → ∞ and → ∞ as n → ∞. log n G. Biau (UPMC) 71 / 106
  • 122. Discussion — Choosing the pnj ’s Each coordinate in S will be cut with the “ideal” probability Mn 1 S pn = 1− 1− . S d The parameter Mn should satisfy Mn S 1− log n → 0 as n → ∞. d This is true as soon as Mn Mn → ∞ and → ∞ as n → ∞. log n G. Biau (UPMC) 71 / 106
  • 123. Assumptions We have at hand an independent test set Dn . The model is linear: Y = aj X (j) + ε. j∈S For a fixed node A = d Aj , fix a coordinate j and look at the j=1 weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ). If j ∈ S, then the best split is at the midpoint of the node, with a variance decrease equal to aj2 /16 > 0. If j ∈ W, the decrease of the variance is always 0, whatever the location of the split. G. Biau (UPMC) 72 / 106
  • 124. Assumptions We have at hand an independent test set Dn . The model is linear: Y = aj X (j) + ε. j∈S For a fixed node A = d Aj , fix a coordinate j and look at the j=1 weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ). If j ∈ S, then the best split is at the midpoint of the node, with a variance decrease equal to aj2 /16 > 0. If j ∈ W, the decrease of the variance is always 0, whatever the location of the split. G. Biau (UPMC) 72 / 106
  • 125. Assumptions We have at hand an independent test set Dn . The model is linear: Y = aj X (j) + ε. j∈S For a fixed node A = d Aj , fix a coordinate j and look at the j=1 weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ). If j ∈ S, then the best split is at the midpoint of the node, with a variance decrease equal to aj2 /16 > 0. If j ∈ W, the decrease of the variance is always 0, whatever the location of the split. G. Biau (UPMC) 72 / 106
  • 126. Assumptions We have at hand an independent test set Dn . The model is linear: Y = aj X (j) + ε. j∈S For a fixed node A = d Aj , fix a coordinate j and look at the j=1 weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ). If j ∈ S, then the best split is at the midpoint of the node, with a variance decrease equal to aj2 /16 > 0. If j ∈ W, the decrease of the variance is always 0, whatever the location of the split. G. Biau (UPMC) 72 / 106
  • 127. Assumptions We have at hand an independent test set Dn . The model is linear: Y = aj X (j) + ε. j∈S For a fixed node A = d Aj , fix a coordinate j and look at the j=1 weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ). If j ∈ S, then the best split is at the midpoint of the node, with a variance decrease equal to aj2 /16 > 0. If j ∈ W, the decrease of the variance is always 0, whatever the location of the split. G. Biau (UPMC) 72 / 106
  • 128. Assumptions We have at hand an independent test set Dn . The model is linear: Y = aj X (j) + ε. j∈S For a fixed node A = d Aj , fix a coordinate j and look at the j=1 weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ). If j ∈ S, then the best split is at the midpoint of the node, with a variance decrease equal to aj2 /16 > 0. If j ∈ W, the decrease of the variance is always 0, whatever the location of the split. G. Biau (UPMC) 72 / 106
  • 129. Discussion — Choosing the pnj ’s Near-reality scenario The following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 For each of the Mn elected coordinates, calculate the best split. 3 Select the coordinate which outputs the best within-node sum of squares decrease, and cut. Conclusion For j ∈ S, 1 pnj ≈ 1 + ξnj , S where ξnj → 0 and satisfies the constraint ξnj log n → 0 as n tends to infinity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞. G. Biau (UPMC) 73 / 106
  • 130. Discussion — Choosing the pnj ’s Near-reality scenario The following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 For each of the Mn elected coordinates, calculate the best split. 3 Select the coordinate which outputs the best within-node sum of squares decrease, and cut. Conclusion For j ∈ S, 1 pnj ≈ 1 + ξnj , S where ξnj → 0 and satisfies the constraint ξnj log n → 0 as n tends to infinity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞. G. Biau (UPMC) 73 / 106
  • 131. Discussion — Choosing the pnj ’s Near-reality scenario The following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 For each of the Mn elected coordinates, calculate the best split. 3 Select the coordinate which outputs the best within-node sum of squares decrease, and cut. Conclusion For j ∈ S, 1 pnj ≈ 1 + ξnj , S where ξnj → 0 and satisfies the constraint ξnj log n → 0 as n tends to infinity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞. G. Biau (UPMC) 73 / 106
  • 132. Discussion — Choosing the pnj ’s Near-reality scenario The following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 For each of the Mn elected coordinates, calculate the best split. 3 Select the coordinate which outputs the best within-node sum of squares decrease, and cut. Conclusion For j ∈ S, 1 pnj ≈ 1 + ξnj , S where ξnj → 0 and satisfies the constraint ξnj log n → 0 as n tends to infinity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞. G. Biau (UPMC) 73 / 106
  • 133. Discussion — Choosing the pnj ’s Near-reality scenario The following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 For each of the Mn elected coordinates, calculate the best split. 3 Select the coordinate which outputs the best within-node sum of squares decrease, and cut. Conclusion For j ∈ S, 1 pnj ≈ 1 + ξnj , S where ξnj → 0 and satisfies the constraint ξnj log n → 0 as n tends to infinity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞. G. Biau (UPMC) 73 / 106
  • 134. Outline 1 Setting 2 A random forests model 3 Layered nearest neighbors and random forests G. Biau (UPMC) 74 / 106
  • 135. ? G. Biau (UPMC) 75 / 106
  • 136. ? G. Biau (UPMC) 76 / 106
  • 137. ? G. Biau (UPMC) 77 / 106
  • 138. ? G. Biau (UPMC) 78 / 106
  • 139. ? G. Biau (UPMC) 79 / 106
  • 140. G. Biau (UPMC) 80 / 106
  • 141. ? G. Biau (UPMC) 81 / 106
  • 142. ? G. Biau (UPMC) 82 / 106
  • 143. ? G. Biau (UPMC) 83 / 106
  • 144. ? G. Biau (UPMC) 84 / 106
  • 145. ? G. Biau (UPMC) 85 / 106
  • 146. G. Biau (UPMC) 86 / 106
  • 147. ? G. Biau (UPMC) 87 / 106
  • 148. ? G. Biau (UPMC) 88 / 106
  • 149. ? G. Biau (UPMC) 89 / 106
  • 150. ? G. Biau (UPMC) 90 / 106
  • 151. ? G. Biau (UPMC) 91 / 106
  • 152. ? G. Biau (UPMC) 92 / 106
  • 153. ? G. Biau (UPMC) 93 / 106
  • 154. ? G. Biau (UPMC) 94 / 106
  • 155. ? G. Biau (UPMC) 95 / 106
  • 156. ? G. Biau (UPMC) 96 / 106
  • 157. ? G. Biau (UPMC) 97 / 106
  • 158. ? G. Biau (UPMC) 98 / 106
  • 159. Layered Nearest Neighbors Definition Let X1 , . . . , Xn be a sample of i.i.d. random vectors in Rd , d ≥ 2. An observation Xi is said to be a LNN of a point x if the hyperrectangle defined by x and Xi contains no other data points. x empty G. Biau (UPMC) 99 / 106
  • 160. What is known about Ln (x)? ... a lot when X1 , . . . , Xn are uniformly distributed over [0, 1]d . For example, 2d (log n)d−1 ELn (x) = + O (log n)d−2 (d − 1)! and (d − 1)! Ln (x) → 1 in probability as n → ∞. 2d (log n)d−1 This is the problem of maxima in random vectors (Barndorff-Nielsen and Sobel, 1966). G. Biau (UPMC) 100 / 106
  • 161. What is known about Ln (x)? ... a lot when X1 , . . . , Xn are uniformly distributed over [0, 1]d . For example, 2d (log n)d−1 ELn (x) = + O (log n)d−2 (d − 1)! and (d − 1)! Ln (x) → 1 in probability as n → ∞. 2d (log n)d−1 This is the problem of maxima in random vectors (Barndorff-Nielsen and Sobel, 1966). G. Biau (UPMC) 100 / 106
  • 162. What is known about Ln (x)? ... a lot when X1 , . . . , Xn are uniformly distributed over [0, 1]d . For example, 2d (log n)d−1 ELn (x) = + O (log n)d−2 (d − 1)! and (d − 1)! Ln (x) → 1 in probability as n → ∞. 2d (log n)d−1 This is the problem of maxima in random vectors (Barndorff-Nielsen and Sobel, 1966). G. Biau (UPMC) 100 / 106
  • 163. Two results (Biau and Devroye, 2010) Model X1 , . . . , Xn are independently distributed according to some probability density f (with probability measure µ). Theorem For µ-almost all x ∈ Rd , one has Ln (x) → ∞ in probability as n → ∞. Theorem Suppose that f is λ-almost everywhere continuous. Then (d − 1)! ELn (x) → 1 as n → ∞, 2d (log n)d−1 at µ-almost all x ∈ Rd . G. Biau (UPMC) 101 / 106
  • 164. Two results (Biau and Devroye, 2010) Model X1 , . . . , Xn are independently distributed according to some probability density f (with probability measure µ). Theorem For µ-almost all x ∈ Rd , one has Ln (x) → ∞ in probability as n → ∞. Theorem Suppose that f is λ-almost everywhere continuous. Then (d − 1)! ELn (x) → 1 as n → ∞, 2d (log n)d−1 at µ-almost all x ∈ Rd . G. Biau (UPMC) 101 / 106
  • 165. Two results (Biau and Devroye, 2010) Model X1 , . . . , Xn are independently distributed according to some probability density f (with probability measure µ). Theorem For µ-almost all x ∈ Rd , one has Ln (x) → ∞ in probability as n → ∞. Theorem Suppose that f is λ-almost everywhere continuous. Then (d − 1)! ELn (x) → 1 as n → ∞, 2d (log n)d−1 at µ-almost all x ∈ Rd . G. Biau (UPMC) 101 / 106
  • 166. LNN regression estimation Model (X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R. Moreover, |Y | is bounded and X has a density. The regression function r (x) = E[Y |X = x] may be estimated by n 1 rn (x) = Yi 1[Xi ∈Ln (x)] . Ln (x) i=1 1 No smoothing parameter. 2 A scale-invariant estimate. 3 Intimately connected to Breiman’s random forests. G. Biau (UPMC) 102 / 106
  • 167. LNN regression estimation Model (X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R. Moreover, |Y | is bounded and X has a density. The regression function r (x) = E[Y |X = x] may be estimated by n 1 rn (x) = Yi 1[Xi ∈Ln (x)] . Ln (x) i=1 1 No smoothing parameter. 2 A scale-invariant estimate. 3 Intimately connected to Breiman’s random forests. G. Biau (UPMC) 102 / 106
  • 168. LNN regression estimation Model (X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R. Moreover, |Y | is bounded and X has a density. The regression function r (x) = E[Y |X = x] may be estimated by n 1 rn (x) = Yi 1[Xi ∈Ln (x)] . Ln (x) i=1 1 No smoothing parameter. 2 A scale-invariant estimate. 3 Intimately connected to Breiman’s random forests. G. Biau (UPMC) 102 / 106
  • 169. LNN regression estimation Model (X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R. Moreover, |Y | is bounded and X has a density. The regression function r (x) = E[Y |X = x] may be estimated by n 1 rn (x) = Yi 1[Xi ∈Ln (x)] . Ln (x) i=1 1 No smoothing parameter. 2 A scale-invariant estimate. 3 Intimately connected to Breiman’s random forests. G. Biau (UPMC) 102 / 106
  • 170. LNN regression estimation Model (X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R. Moreover, |Y | is bounded and X has a density. The regression function r (x) = E[Y |X = x] may be estimated by n 1 rn (x) = Yi 1[Xi ∈Ln (x)] . Ln (x) i=1 1 No smoothing parameter. 2 A scale-invariant estimate. 3 Intimately connected to Breiman’s random forests. G. Biau (UPMC) 102 / 106
  • 171. Consistency Theorem (Pointwise Lp -consistency) Assume that the regression function r is λ-almost everywhere continuous and that Y is bounded. Then, for µ-almost all x ∈ Rd and all p ≥ 1, E |rn (x) − r (x)|p → 0 as n → ∞. Theorem (Gobal Lp -consistency) Under the same conditions, for all p ≥ 1, E |rn (X) − r (X)|p → 0 as n → ∞. 1 No universal consistency result with respect to r is possible. 2 The results do not impose any condition on the density. 3 They are also scale-free. G. Biau (UPMC) 103 / 106
  • 172. Consistency Theorem (Pointwise Lp -consistency) Assume that the regression function r is λ-almost everywhere continuous and that Y is bounded. Then, for µ-almost all x ∈ Rd and all p ≥ 1, E |rn (x) − r (x)|p → 0 as n → ∞. Theorem (Gobal Lp -consistency) Under the same conditions, for all p ≥ 1, E |rn (X) − r (X)|p → 0 as n → ∞. 1 No universal consistency result with respect to r is possible. 2 The results do not impose any condition on the density. 3 They are also scale-free. G. Biau (UPMC) 103 / 106
  • 173. Consistency Theorem (Pointwise Lp -consistency) Assume that the regression function r is λ-almost everywhere continuous and that Y is bounded. Then, for µ-almost all x ∈ Rd and all p ≥ 1, E |rn (x) − r (x)|p → 0 as n → ∞. Theorem (Gobal Lp -consistency) Under the same conditions, for all p ≥ 1, E |rn (X) − r (X)|p → 0 as n → ∞. 1 No universal consistency result with respect to r is possible. 2 The results do not impose any condition on the density. 3 They are also scale-free. G. Biau (UPMC) 103 / 106
  • 174. Consistency Theorem (Pointwise Lp -consistency) Assume that the regression function r is λ-almost everywhere continuous and that Y is bounded. Then, for µ-almost all x ∈ Rd and all p ≥ 1, E |rn (x) − r (x)|p → 0 as n → ∞. Theorem (Gobal Lp -consistency) Under the same conditions, for all p ≥ 1, E |rn (X) − r (X)|p → 0 as n → ∞. 1 No universal consistency result with respect to r is possible. 2 The results do not impose any condition on the density. 3 They are also scale-free. G. Biau (UPMC) 103 / 106
  • 175. Consistency Theorem (Pointwise Lp -consistency) Assume that the regression function r is λ-almost everywhere continuous and that Y is bounded. Then, for µ-almost all x ∈ Rd and all p ≥ 1, E |rn (x) − r (x)|p → 0 as n → ∞. Theorem (Gobal Lp -consistency) Under the same conditions, for all p ≥ 1, E |rn (X) − r (X)|p → 0 as n → ∞. 1 No universal consistency result with respect to r is possible. 2 The results do not impose any condition on the density. 3 They are also scale-free. G. Biau (UPMC) 103 / 106
  • 176. Back to random forests A random forest can be viewed as a weighted LNN regression estimate n ¯n (x) = r Yi Wni (x), i=1 where the weights concentrate on the LNN and satisfy n Wni (x) = 1. i=1 G. Biau (UPMC) 104 / 106
  • 177. Non-adaptive strategies Consider the non-adaptive random forests estimate n ¯n (x) = r Yi Wni (x). i=1 where the weights concentrate on the LNN. Proposition For any x ∈ Rd , assume that σ 2 = V[Y |X = x] is independent of x. Then σ2 E [¯n (x) − r (x)]2 ≥ r . ELn (x) G. Biau (UPMC) 105 / 106
  • 178. Non-adaptive strategies Consider the non-adaptive random forests estimate n ¯n (x) = r Yi Wni (x). i=1 where the weights concentrate on the LNN. Proposition For any x ∈ Rd , assume that σ 2 = V[Y |X = x] is independent of x. Then σ2 E [¯n (x) − r (x)]2 ≥ r . ELn (x) G. Biau (UPMC) 105 / 106
  • 179. Rate of convergence At µ-almost all x, when f is λ-almost everywhere continuous, σ 2 (d − 1)! E [¯n (x) − r (x)]2 r . 2d (log n)d−1 Improving the rate of convergence 1 Stop as soon as a future rectangle split would cause a sub-rectangle to have fewer than kn points. 2 Resort to bagging and randomize using random subsamples. G. Biau (UPMC) 106 / 106
  • 180. Rate of convergence At µ-almost all x, when f is λ-almost everywhere continuous, σ 2 (d − 1)! E [¯n (x) − r (x)]2 r . 2d (log n)d−1 Improving the rate of convergence 1 Stop as soon as a future rectangle split would cause a sub-rectangle to have fewer than kn points. 2 Resort to bagging and randomize using random subsamples. G. Biau (UPMC) 106 / 106
  • 181. Rate of convergence At µ-almost all x, when f is λ-almost everywhere continuous, σ 2 (d − 1)! E [¯n (x) − r (x)]2 r . 2d (log n)d−1 Improving the rate of convergence 1 Stop as soon as a future rectangle split would cause a sub-rectangle to have fewer than kn points. 2 Resort to bagging and randomize using random subsamples. G. Biau (UPMC) 106 / 106