34. From trees to forests
Leo Breiman promoted random forests.
Idea: Using tree averaging as a means of obtaining good rules.
The base trees are simple and randomized.
Breiman’s ideas were decisively influenced by
Amit and Geman (1997, geometric feature selection).
Ho (1998, random subspace method).
Dietterich (2000, random split selection approach).
stat.berkeley.edu/users/breiman/RandomForests
G. Biau (UPMC) 34 / 106
35. From trees to forests
Leo Breiman promoted random forests.
Idea: Using tree averaging as a means of obtaining good rules.
The base trees are simple and randomized.
Breiman’s ideas were decisively influenced by
Amit and Geman (1997, geometric feature selection).
Ho (1998, random subspace method).
Dietterich (2000, random split selection approach).
stat.berkeley.edu/users/breiman/RandomForests
G. Biau (UPMC) 34 / 106
36. From trees to forests
Leo Breiman promoted random forests.
Idea: Using tree averaging as a means of obtaining good rules.
The base trees are simple and randomized.
Breiman’s ideas were decisively influenced by
Amit and Geman (1997, geometric feature selection).
Ho (1998, random subspace method).
Dietterich (2000, random split selection approach).
stat.berkeley.edu/users/breiman/RandomForests
G. Biau (UPMC) 34 / 106
37. From trees to forests
Leo Breiman promoted random forests.
Idea: Using tree averaging as a means of obtaining good rules.
The base trees are simple and randomized.
Breiman’s ideas were decisively influenced by
Amit and Geman (1997, geometric feature selection).
Ho (1998, random subspace method).
Dietterich (2000, random split selection approach).
stat.berkeley.edu/users/breiman/RandomForests
G. Biau (UPMC) 34 / 106
38. From trees to forests
Leo Breiman promoted random forests.
Idea: Using tree averaging as a means of obtaining good rules.
The base trees are simple and randomized.
Breiman’s ideas were decisively influenced by
Amit and Geman (1997, geometric feature selection).
Ho (1998, random subspace method).
Dietterich (2000, random split selection approach).
stat.berkeley.edu/users/breiman/RandomForests
G. Biau (UPMC) 34 / 106
39. Random forests
They have emerged as serious competitors to state of the art
methods.
They are fast and easy to implement, produce highly accurate
predictions and can handle a very large number of input variables
without overfitting.
In fact, forests are among the most accurate general-purpose
learners available.
The algorithm is difficult to analyze and its mathematical
properties remain to date largely unknown.
Most theoretical studies have concentrated on isolated parts or
stylized versions of the procedure.
G. Biau (UPMC) 35 / 106
40. Random forests
They have emerged as serious competitors to state of the art
methods.
They are fast and easy to implement, produce highly accurate
predictions and can handle a very large number of input variables
without overfitting.
In fact, forests are among the most accurate general-purpose
learners available.
The algorithm is difficult to analyze and its mathematical
properties remain to date largely unknown.
Most theoretical studies have concentrated on isolated parts or
stylized versions of the procedure.
G. Biau (UPMC) 35 / 106
41. Random forests
They have emerged as serious competitors to state of the art
methods.
They are fast and easy to implement, produce highly accurate
predictions and can handle a very large number of input variables
without overfitting.
In fact, forests are among the most accurate general-purpose
learners available.
The algorithm is difficult to analyze and its mathematical
properties remain to date largely unknown.
Most theoretical studies have concentrated on isolated parts or
stylized versions of the procedure.
G. Biau (UPMC) 35 / 106
42. Random forests
They have emerged as serious competitors to state of the art
methods.
They are fast and easy to implement, produce highly accurate
predictions and can handle a very large number of input variables
without overfitting.
In fact, forests are among the most accurate general-purpose
learners available.
The algorithm is difficult to analyze and its mathematical
properties remain to date largely unknown.
Most theoretical studies have concentrated on isolated parts or
stylized versions of the procedure.
G. Biau (UPMC) 35 / 106
43. Random forests
They have emerged as serious competitors to state of the art
methods.
They are fast and easy to implement, produce highly accurate
predictions and can handle a very large number of input variables
without overfitting.
In fact, forests are among the most accurate general-purpose
learners available.
The algorithm is difficult to analyze and its mathematical
properties remain to date largely unknown.
Most theoretical studies have concentrated on isolated parts or
stylized versions of the procedure.
G. Biau (UPMC) 35 / 106
44. Key-references
Breiman (2000, 2001, 2004).
Definition, experiments and intuitions.
Lin and Jeon (2006).
Link with layered nearest neighbors.
Biau, Devroye and Lugosi (2008).
Consistency results for stylized versions.
Biau (2010).
Sparsity and random forests.
G. Biau (UPMC) 36 / 106
45. Key-references
Breiman (2000, 2001, 2004).
Definition, experiments and intuitions.
Lin and Jeon (2006).
Link with layered nearest neighbors.
Biau, Devroye and Lugosi (2008).
Consistency results for stylized versions.
Biau (2010).
Sparsity and random forests.
G. Biau (UPMC) 36 / 106
46. Key-references
Breiman (2000, 2001, 2004).
Definition, experiments and intuitions.
Lin and Jeon (2006).
Link with layered nearest neighbors.
Biau, Devroye and Lugosi (2008).
Consistency results for stylized versions.
Biau (2010).
Sparsity and random forests.
G. Biau (UPMC) 36 / 106
47. Key-references
Breiman (2000, 2001, 2004).
Definition, experiments and intuitions.
Lin and Jeon (2006).
Link with layered nearest neighbors.
Biau, Devroye and Lugosi (2008).
Consistency results for stylized versions.
Biau (2010).
Sparsity and random forests.
G. Biau (UPMC) 36 / 106
48. Outline
1 Setting
2 A random forests model
3 Layered nearest neighbors and random forests
G. Biau (UPMC) 37 / 106
49. Three basic ingredients
1-Randomization and no-pruning
For each tree, select at random, at each node, a small group of
input coordinates to split.
Calculate the best split based on these features and cut.
The tree is grown to maximum size, without pruning.
G. Biau (UPMC) 38 / 106
50. Three basic ingredients
1-Randomization and no-pruning
For each tree, select at random, at each node, a small group of
input coordinates to split.
Calculate the best split based on these features and cut.
The tree is grown to maximum size, without pruning.
G. Biau (UPMC) 38 / 106
51. Three basic ingredients
1-Randomization and no-pruning
For each tree, select at random, at each node, a small group of
input coordinates to split.
Calculate the best split based on these features and cut.
The tree is grown to maximum size, without pruning.
G. Biau (UPMC) 38 / 106
52. Three basic ingredients
2-Aggregation
Final predictions are obtained by aggregating over the ensemble.
It is fast and easily parallelizable.
G. Biau (UPMC) 39 / 106
53. Three basic ingredients
2-Aggregation
Final predictions are obtained by aggregating over the ensemble.
It is fast and easily parallelizable.
G. Biau (UPMC) 39 / 106
54. Three basic ingredients
3-Bagging
The subspace randomization scheme is blended with bagging.
Breiman (1996).
Bühlmann and Yu (2002).
Biau, Cérou and Guyader (2010).
G. Biau (UPMC) 40 / 106
55. Three basic ingredients
3-Bagging
The subspace randomization scheme is blended with bagging.
Breiman (1996).
Bühlmann and Yu (2002).
Biau, Cérou and Guyader (2010).
G. Biau (UPMC) 40 / 106
56. Mathematical framework
A training sample: Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d.
[0, 1]d × R-valued random variables.
A generic pair: (X, Y ) satisfying EY 2 < ∞.
Our mission: For fixed x ∈ [0, 1]d , estimate the regression function
r (x) = E[Y |X = x] using the data.
Quality criterion: E[rn (X) − r (X)]2 .
G. Biau (UPMC) 41 / 106
57. Mathematical framework
A training sample: Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d.
[0, 1]d × R-valued random variables.
A generic pair: (X, Y ) satisfying EY 2 < ∞.
Our mission: For fixed x ∈ [0, 1]d , estimate the regression function
r (x) = E[Y |X = x] using the data.
Quality criterion: E[rn (X) − r (X)]2 .
G. Biau (UPMC) 41 / 106
58. Mathematical framework
A training sample: Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d.
[0, 1]d × R-valued random variables.
A generic pair: (X, Y ) satisfying EY 2 < ∞.
Our mission: For fixed x ∈ [0, 1]d , estimate the regression function
r (x) = E[Y |X = x] using the data.
Quality criterion: E[rn (X) − r (X)]2 .
G. Biau (UPMC) 41 / 106
59. Mathematical framework
A training sample: Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d.
[0, 1]d × R-valued random variables.
A generic pair: (X, Y ) satisfying EY 2 < ∞.
Our mission: For fixed x ∈ [0, 1]d , estimate the regression function
r (x) = E[Y |X = x] using the data.
Quality criterion: E[rn (X) − r (X)]2 .
G. Biau (UPMC) 41 / 106
60. The model
A random forest is a collection of randomized base regression
trees {rn (x, Θm , Dn ), m ≥ 1}.
These random trees are combined to form the aggregated
regression estimate
¯n (X, Dn ) = EΘ [rn (X, Θ, Dn )] .
r
Θ is assumed to be independent of X and the training sample Dn .
However, we allow Θ to be based on a test sample, independent
of, but distributed as, Dn .
G. Biau (UPMC) 42 / 106
61. The model
A random forest is a collection of randomized base regression
trees {rn (x, Θm , Dn ), m ≥ 1}.
These random trees are combined to form the aggregated
regression estimate
¯n (X, Dn ) = EΘ [rn (X, Θ, Dn )] .
r
Θ is assumed to be independent of X and the training sample Dn .
However, we allow Θ to be based on a test sample, independent
of, but distributed as, Dn .
G. Biau (UPMC) 42 / 106
62. The model
A random forest is a collection of randomized base regression
trees {rn (x, Θm , Dn ), m ≥ 1}.
These random trees are combined to form the aggregated
regression estimate
¯n (X, Dn ) = EΘ [rn (X, Θ, Dn )] .
r
Θ is assumed to be independent of X and the training sample Dn .
However, we allow Θ to be based on a test sample, independent
of, but distributed as, Dn .
G. Biau (UPMC) 42 / 106
63. The model
A random forest is a collection of randomized base regression
trees {rn (x, Θm , Dn ), m ≥ 1}.
These random trees are combined to form the aggregated
regression estimate
¯n (X, Dn ) = EΘ [rn (X, Θ, Dn )] .
r
Θ is assumed to be independent of X and the training sample Dn .
However, we allow Θ to be based on a test sample, independent
of, but distributed as, Dn .
G. Biau (UPMC) 42 / 106
64. The procedure
Fix kn ≥ 2 and repeat the following procedure log2 kn times:
1 At each node, a coordinate of X = (X (1) , . . . , X (d) ) is selected, with the
j-th feature having a probability pnj ∈ (0, 1) of being selected.
2 At each node, once the coordinate is selected, the split is at the midpoint
of the chosen side.
Thus
n
i=1 Yi 1[Xi ∈An (X,Θ)]
¯n (X) = EΘ
r n 1En (X,Θ) ,
i=1 1[Xi ∈An (X,Θ)]
where
n
En (X, Θ) = 1[Xi ∈An (X,Θ)] = 0 .
i=1
G. Biau (UPMC) 43 / 106
65. The procedure
Fix kn ≥ 2 and repeat the following procedure log2 kn times:
1 At each node, a coordinate of X = (X (1) , . . . , X (d) ) is selected, with the
j-th feature having a probability pnj ∈ (0, 1) of being selected.
2 At each node, once the coordinate is selected, the split is at the midpoint
of the chosen side.
Thus
n
i=1 Yi 1[Xi ∈An (X,Θ)]
¯n (X) = EΘ
r n 1En (X,Θ) ,
i=1 1[Xi ∈An (X,Θ)]
where
n
En (X, Θ) = 1[Xi ∈An (X,Θ)] = 0 .
i=1
G. Biau (UPMC) 43 / 106
66. The procedure
Fix kn ≥ 2 and repeat the following procedure log2 kn times:
1 At each node, a coordinate of X = (X (1) , . . . , X (d) ) is selected, with the
j-th feature having a probability pnj ∈ (0, 1) of being selected.
2 At each node, once the coordinate is selected, the split is at the midpoint
of the chosen side.
Thus
n
i=1 Yi 1[Xi ∈An (X,Θ)]
¯n (X) = EΘ
r n 1En (X,Θ) ,
i=1 1[Xi ∈An (X,Θ)]
where
n
En (X, Θ) = 1[Xi ∈An (X,Θ)] = 0 .
i=1
G. Biau (UPMC) 43 / 106
72. General comments
Each individual tree has exactly 2 log2 kn (≈ kn ) terminal nodes,
and each leaf has Lebesgue measure 2− log2 kn (≈ 1/kn ).
If X has uniform distribution on [0, 1]d , there will be on average
about n/kn observations per terminal node.
The choice kn = n induces a very small number of cases in the
final leaves.
This scheme is close to what the original random
forests algorithm does.
G. Biau (UPMC) 49 / 106
73. General comments
Each individual tree has exactly 2 log2 kn (≈ kn ) terminal nodes,
and each leaf has Lebesgue measure 2− log2 kn (≈ 1/kn ).
If X has uniform distribution on [0, 1]d , there will be on average
about n/kn observations per terminal node.
The choice kn = n induces a very small number of cases in the
final leaves.
This scheme is close to what the original random
forests algorithm does.
G. Biau (UPMC) 49 / 106
74. General comments
Each individual tree has exactly 2 log2 kn (≈ kn ) terminal nodes,
and each leaf has Lebesgue measure 2− log2 kn (≈ 1/kn ).
If X has uniform distribution on [0, 1]d , there will be on average
about n/kn observations per terminal node.
The choice kn = n induces a very small number of cases in the
final leaves.
This scheme is close to what the original random
forests algorithm does.
G. Biau (UPMC) 49 / 106
75. General comments
Each individual tree has exactly 2 log2 kn (≈ kn ) terminal nodes,
and each leaf has Lebesgue measure 2− log2 kn (≈ 1/kn ).
If X has uniform distribution on [0, 1]d , there will be on average
about n/kn observations per terminal node.
The choice kn = n induces a very small number of cases in the
final leaves.
This scheme is close to what the original random
forests algorithm does.
G. Biau (UPMC) 49 / 106
76. Consistency
Theorem
Assume that the distribution of X has support on [0, 1]d . Then the
random forests estimate ¯n is consistent whenever pnj log kn → ∞ for
r
all j = 1, . . . , d and kn /n → 0 as n → ∞.
In the purely random model, pnj = 1/d, independently of n and j,
and consistency is ensured as long as kn → ∞ and kn /n → 0.
This is however a radically simplified version of the random forests
used in practice.
A more in-depth analysis is needed.
G. Biau (UPMC) 50 / 106
77. Consistency
Theorem
Assume that the distribution of X has support on [0, 1]d . Then the
random forests estimate ¯n is consistent whenever pnj log kn → ∞ for
r
all j = 1, . . . , d and kn /n → 0 as n → ∞.
In the purely random model, pnj = 1/d, independently of n and j,
and consistency is ensured as long as kn → ∞ and kn /n → 0.
This is however a radically simplified version of the random forests
used in practice.
A more in-depth analysis is needed.
G. Biau (UPMC) 50 / 106
78. Consistency
Theorem
Assume that the distribution of X has support on [0, 1]d . Then the
random forests estimate ¯n is consistent whenever pnj log kn → ∞ for
r
all j = 1, . . . , d and kn /n → 0 as n → ∞.
In the purely random model, pnj = 1/d, independently of n and j,
and consistency is ensured as long as kn → ∞ and kn /n → 0.
This is however a radically simplified version of the random forests
used in practice.
A more in-depth analysis is needed.
G. Biau (UPMC) 50 / 106
79. Consistency
Theorem
Assume that the distribution of X has support on [0, 1]d . Then the
random forests estimate ¯n is consistent whenever pnj log kn → ∞ for
r
all j = 1, . . . , d and kn /n → 0 as n → ∞.
In the purely random model, pnj = 1/d, independently of n and j,
and consistency is ensured as long as kn → ∞ and kn /n → 0.
This is however a radically simplified version of the random forests
used in practice.
A more in-depth analysis is needed.
G. Biau (UPMC) 50 / 106
80. Sparsity
There is empirical evidence that many signals in high-dimensional
spaces admit a sparse representation.
Images wavelet coefficients.
High-throughput technologies.
Sparse estimation is playing an increasingly important role in the
statistics and machine learning communities.
Several methods have recently been developed in both fields,
which rely upon the notion of sparsity.
G. Biau (UPMC) 51 / 106
81. Sparsity
There is empirical evidence that many signals in high-dimensional
spaces admit a sparse representation.
Images wavelet coefficients.
High-throughput technologies.
Sparse estimation is playing an increasingly important role in the
statistics and machine learning communities.
Several methods have recently been developed in both fields,
which rely upon the notion of sparsity.
G. Biau (UPMC) 51 / 106
82. Sparsity
There is empirical evidence that many signals in high-dimensional
spaces admit a sparse representation.
Images wavelet coefficients.
High-throughput technologies.
Sparse estimation is playing an increasingly important role in the
statistics and machine learning communities.
Several methods have recently been developed in both fields,
which rely upon the notion of sparsity.
G. Biau (UPMC) 51 / 106
83. Sparsity
There is empirical evidence that many signals in high-dimensional
spaces admit a sparse representation.
Images wavelet coefficients.
High-throughput technologies.
Sparse estimation is playing an increasingly important role in the
statistics and machine learning communities.
Several methods have recently been developed in both fields,
which rely upon the notion of sparsity.
G. Biau (UPMC) 51 / 106
84. Sparsity
There is empirical evidence that many signals in high-dimensional
spaces admit a sparse representation.
Images wavelet coefficients.
High-throughput technologies.
Sparse estimation is playing an increasingly important role in the
statistics and machine learning communities.
Several methods have recently been developed in both fields,
which rely upon the notion of sparsity.
G. Biau (UPMC) 51 / 106
85. Our vision
The regression function r (X) = E[Y |X] depends in fact only on a
nonempty subset S (for Strong) of the d features.
In other words, letting XS = (Xj : j ∈ S) and S = Card S, we have
r (X) = E[Y |XS ].
In the dimension reduction scenario we have in mind, the ambient
dimension d can be very large, much larger than n.
As such, the value S characterizes the sparsity of the model: The
smaller S, the sparser r .
G. Biau (UPMC) 52 / 106
86. Our vision
The regression function r (X) = E[Y |X] depends in fact only on a
nonempty subset S (for Strong) of the d features.
In other words, letting XS = (Xj : j ∈ S) and S = Card S, we have
r (X) = E[Y |XS ].
In the dimension reduction scenario we have in mind, the ambient
dimension d can be very large, much larger than n.
As such, the value S characterizes the sparsity of the model: The
smaller S, the sparser r .
G. Biau (UPMC) 52 / 106
87. Our vision
The regression function r (X) = E[Y |X] depends in fact only on a
nonempty subset S (for Strong) of the d features.
In other words, letting XS = (Xj : j ∈ S) and S = Card S, we have
r (X) = E[Y |XS ].
In the dimension reduction scenario we have in mind, the ambient
dimension d can be very large, much larger than n.
As such, the value S characterizes the sparsity of the model: The
smaller S, the sparser r .
G. Biau (UPMC) 52 / 106
88. Our vision
The regression function r (X) = E[Y |X] depends in fact only on a
nonempty subset S (for Strong) of the d features.
In other words, letting XS = (Xj : j ∈ S) and S = Card S, we have
r (X) = E[Y |XS ].
In the dimension reduction scenario we have in mind, the ambient
dimension d can be very large, much larger than n.
As such, the value S characterizes the sparsity of the model: The
smaller S, the sparser r .
G. Biau (UPMC) 52 / 106
89. Sparsity and random forests
Ideally, pnj = 1/S for j ∈ S.
To stick to reality, we will rather require that pnj = (1/S)(1 + ξnj ).
Such a randomization mechanism may be designed on the basis
of a test sample.
Action plan
E [¯n (X) − r (X)]2 = E [¯n (X) − ˜n (X)]2 + E [˜n (X) − r (X)]2 ,
r r r r
where
n
˜n (X) =
r EΘ [Wni (X, Θ)] r (Xi ).
i=1
G. Biau (UPMC) 53 / 106
90. Sparsity and random forests
Ideally, pnj = 1/S for j ∈ S.
To stick to reality, we will rather require that pnj = (1/S)(1 + ξnj ).
Such a randomization mechanism may be designed on the basis
of a test sample.
Action plan
E [¯n (X) − r (X)]2 = E [¯n (X) − ˜n (X)]2 + E [˜n (X) − r (X)]2 ,
r r r r
where
n
˜n (X) =
r EΘ [Wni (X, Θ)] r (Xi ).
i=1
G. Biau (UPMC) 53 / 106
91. Sparsity and random forests
Ideally, pnj = 1/S for j ∈ S.
To stick to reality, we will rather require that pnj = (1/S)(1 + ξnj ).
Such a randomization mechanism may be designed on the basis
of a test sample.
Action plan
E [¯n (X) − r (X)]2 = E [¯n (X) − ˜n (X)]2 + E [˜n (X) − r (X)]2 ,
r r r r
where
n
˜n (X) =
r EΘ [Wni (X, Θ)] r (Xi ).
i=1
G. Biau (UPMC) 53 / 106
92. Sparsity and random forests
Ideally, pnj = 1/S for j ∈ S.
To stick to reality, we will rather require that pnj = (1/S)(1 + ξnj ).
Such a randomization mechanism may be designed on the basis
of a test sample.
Action plan
E [¯n (X) − r (X)]2 = E [¯n (X) − ˜n (X)]2 + E [˜n (X) − r (X)]2 ,
r r r r
where
n
˜n (X) =
r EΘ [Wni (X, Θ)] r (Xi ).
i=1
G. Biau (UPMC) 53 / 106
93. Variance
Proposition
Assume that X is uniformly distributed on [0, 1]d and, for all x ∈ Rd ,
σ 2 (x) = V[Y | X = x] ≤ σ 2
for some positive constant σ 2 . Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S,
S/2d
S2 kn
E [¯n (X) − ˜n (X)]2 ≤ Cσ 2
r r (1 + ξn ) ,
S−1 n(log kn )S/2d
where
S/2d
288 π log 2
C= .
π 16
G. Biau (UPMC) 54 / 106
94. Bias
Proposition
Assume that X is uniformly distributed on [0, 1]d and r is L-Lipschitz.
Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S,
2SL2
E [˜n (X) − r (X)]2 ≤
r 0.75
+ sup r 2 (x) e−n/2kn .
(1+γn )
kn S log 2 x∈[0,1]d
The rate at which the bias decreases to 0 depends on the number
of strong variables, not on d.
kn −(0.75/(S log 2))(1+γn ) = o(kn −2/d ) as soon as S ≤ 0.54d .
The term e−n/2kn prevents the extreme choice kn = n.
G. Biau (UPMC) 55 / 106
95. Bias
Proposition
Assume that X is uniformly distributed on [0, 1]d and r is L-Lipschitz.
Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S,
2SL2
E [˜n (X) − r (X)]2 ≤
r 0.75
+ sup r 2 (x) e−n/2kn .
(1+γn )
kn S log 2 x∈[0,1]d
The rate at which the bias decreases to 0 depends on the number
of strong variables, not on d.
kn −(0.75/(S log 2))(1+γn ) = o(kn −2/d ) as soon as S ≤ 0.54d .
The term e−n/2kn prevents the extreme choice kn = n.
G. Biau (UPMC) 55 / 106
96. Bias
Proposition
Assume that X is uniformly distributed on [0, 1]d and r is L-Lipschitz.
Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S,
2SL2
E [˜n (X) − r (X)]2 ≤
r 0.75
+ sup r 2 (x) e−n/2kn .
(1+γn )
kn S log 2 x∈[0,1]d
The rate at which the bias decreases to 0 depends on the number
of strong variables, not on d.
kn −(0.75/(S log 2))(1+γn ) = o(kn −2/d ) as soon as S ≤ 0.54d .
The term e−n/2kn prevents the extreme choice kn = n.
G. Biau (UPMC) 55 / 106
97. Bias
Proposition
Assume that X is uniformly distributed on [0, 1]d and r is L-Lipschitz.
Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S,
2SL2
E [˜n (X) − r (X)]2 ≤
r 0.75
+ sup r 2 (x) e−n/2kn .
(1+γn )
kn S log 2 x∈[0,1]d
The rate at which the bias decreases to 0 depends on the number
of strong variables, not on d.
kn −(0.75/(S log 2))(1+γn ) = o(kn −2/d ) as soon as S ≤ 0.54d .
The term e−n/2kn prevents the extreme choice kn = n.
G. Biau (UPMC) 55 / 106
98. Main result
Theorem
If pnj = (1/S)(1 + ξnj ) for j ∈ S, with ξnj log n → 0 as n → ∞, then for
the choice
1/(1+ S0.752 )
L2 log
1/(1+ S0.752 )
kn ∝ n log ,
Ξ
we have
E [¯n (X) − r (X)]2
r
lim sup sup 0.75
≤ Λ.
n→∞ (X,Y )∈F 2S ln 2 S ln 2+0.75 −0.75
ΞL 0.75 n S log 2+0.75
Take-home message
−0.75
The rate n S log 2+0.75 is strictly faster than the usual minimax rate
n−2/(d+2) as soon as S ≤ 0.54d .
G. Biau (UPMC) 56 / 106
99. Main result
Theorem
If pnj = (1/S)(1 + ξnj ) for j ∈ S, with ξnj log n → 0 as n → ∞, then for
the choice
1/(1+ S0.752 )
L2 log
1/(1+ S0.752 )
kn ∝ n log ,
Ξ
we have
E [¯n (X) − r (X)]2
r
lim sup sup 0.75
≤ Λ.
n→∞ (X,Y )∈F 2S ln 2 S ln 2+0.75 −0.75
ΞL 0.75 n S log 2+0.75
Take-home message
−0.75
The rate n S log 2+0.75 is strictly faster than the usual minimax rate
n−2/(d+2) as soon as S ≤ 0.54d .
G. Biau (UPMC) 56 / 106
101. Discussion — Bagging
The optimal parameter kn depends on the unknown distribution of
(X, Y ).
To correct this situation, adaptive choices of kn should preserve
the rate of convergence of the estimate.
Another route we may follow is to analyze the effect of bagging.
Biau, Cérou and Guyader (2010).
G. Biau (UPMC) 58 / 106
102. Discussion — Bagging
The optimal parameter kn depends on the unknown distribution of
(X, Y ).
To correct this situation, adaptive choices of kn should preserve
the rate of convergence of the estimate.
Another route we may follow is to analyze the effect of bagging.
Biau, Cérou and Guyader (2010).
G. Biau (UPMC) 58 / 106
103. Discussion — Bagging
The optimal parameter kn depends on the unknown distribution of
(X, Y ).
To correct this situation, adaptive choices of kn should preserve
the rate of convergence of the estimate.
Another route we may follow is to analyze the effect of bagging.
Biau, Cérou and Guyader (2010).
G. Biau (UPMC) 58 / 106
104. Discussion — Bagging
The optimal parameter kn depends on the unknown distribution of
(X, Y ).
To correct this situation, adaptive choices of kn should preserve
the rate of convergence of the estimate.
Another route we may follow is to analyze the effect of bagging.
Biau, Cérou and Guyader (2010).
G. Biau (UPMC) 58 / 106
116. Discussion — Choosing the pnj ’s
Imaginary scenario
The following splitting scheme is iteratively repeated at each node:
1 Select at random Mn candidate coordinates to split on.
2 If the selection is all weak, then choose one at random to split on.
3 If there is more than one strong variable elected, choose one at
random and cut.
G. Biau (UPMC) 70 / 106
117. Discussion — Choosing the pnj ’s
Imaginary scenario
The following splitting scheme is iteratively repeated at each node:
1 Select at random Mn candidate coordinates to split on.
2 If the selection is all weak, then choose one at random to split on.
3 If there is more than one strong variable elected, choose one at
random and cut.
G. Biau (UPMC) 70 / 106
118. Discussion — Choosing the pnj ’s
Imaginary scenario
The following splitting scheme is iteratively repeated at each node:
1 Select at random Mn candidate coordinates to split on.
2 If the selection is all weak, then choose one at random to split on.
3 If there is more than one strong variable elected, choose one at
random and cut.
G. Biau (UPMC) 70 / 106
119. Discussion — Choosing the pnj ’s
Imaginary scenario
The following splitting scheme is iteratively repeated at each node:
1 Select at random Mn candidate coordinates to split on.
2 If the selection is all weak, then choose one at random to split on.
3 If there is more than one strong variable elected, choose one at
random and cut.
G. Biau (UPMC) 70 / 106
120. Discussion — Choosing the pnj ’s
Each coordinate in S will be cut with the “ideal” probability
Mn
1 S
pn = 1− 1− .
S d
The parameter Mn should satisfy
Mn
S
1− log n → 0 as n → ∞.
d
This is true as soon as
Mn
Mn → ∞ and → ∞ as n → ∞.
log n
G. Biau (UPMC) 71 / 106
121. Discussion — Choosing the pnj ’s
Each coordinate in S will be cut with the “ideal” probability
Mn
1 S
pn = 1− 1− .
S d
The parameter Mn should satisfy
Mn
S
1− log n → 0 as n → ∞.
d
This is true as soon as
Mn
Mn → ∞ and → ∞ as n → ∞.
log n
G. Biau (UPMC) 71 / 106
122. Discussion — Choosing the pnj ’s
Each coordinate in S will be cut with the “ideal” probability
Mn
1 S
pn = 1− 1− .
S d
The parameter Mn should satisfy
Mn
S
1− log n → 0 as n → ∞.
d
This is true as soon as
Mn
Mn → ∞ and → ∞ as n → ∞.
log n
G. Biau (UPMC) 71 / 106
123. Assumptions
We have at hand an independent test set Dn .
The model is linear:
Y = aj X (j) + ε.
j∈S
For a fixed node A = d Aj , fix a coordinate j and look at the
j=1
weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ).
If j ∈ S, then the best split is at the midpoint of the node, with a
variance decrease equal to aj2 /16 > 0.
If j ∈ W, the decrease of the variance is always 0, whatever the
location of the split.
G. Biau (UPMC) 72 / 106
124. Assumptions
We have at hand an independent test set Dn .
The model is linear:
Y = aj X (j) + ε.
j∈S
For a fixed node A = d Aj , fix a coordinate j and look at the
j=1
weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ).
If j ∈ S, then the best split is at the midpoint of the node, with a
variance decrease equal to aj2 /16 > 0.
If j ∈ W, the decrease of the variance is always 0, whatever the
location of the split.
G. Biau (UPMC) 72 / 106
125. Assumptions
We have at hand an independent test set Dn .
The model is linear:
Y = aj X (j) + ε.
j∈S
For a fixed node A = d Aj , fix a coordinate j and look at the
j=1
weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ).
If j ∈ S, then the best split is at the midpoint of the node, with a
variance decrease equal to aj2 /16 > 0.
If j ∈ W, the decrease of the variance is always 0, whatever the
location of the split.
G. Biau (UPMC) 72 / 106
126. Assumptions
We have at hand an independent test set Dn .
The model is linear:
Y = aj X (j) + ε.
j∈S
For a fixed node A = d Aj , fix a coordinate j and look at the
j=1
weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ).
If j ∈ S, then the best split is at the midpoint of the node, with a
variance decrease equal to aj2 /16 > 0.
If j ∈ W, the decrease of the variance is always 0, whatever the
location of the split.
G. Biau (UPMC) 72 / 106
127. Assumptions
We have at hand an independent test set Dn .
The model is linear:
Y = aj X (j) + ε.
j∈S
For a fixed node A = d Aj , fix a coordinate j and look at the
j=1
weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ).
If j ∈ S, then the best split is at the midpoint of the node, with a
variance decrease equal to aj2 /16 > 0.
If j ∈ W, the decrease of the variance is always 0, whatever the
location of the split.
G. Biau (UPMC) 72 / 106
128. Assumptions
We have at hand an independent test set Dn .
The model is linear:
Y = aj X (j) + ε.
j∈S
For a fixed node A = d Aj , fix a coordinate j and look at the
j=1
weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ).
If j ∈ S, then the best split is at the midpoint of the node, with a
variance decrease equal to aj2 /16 > 0.
If j ∈ W, the decrease of the variance is always 0, whatever the
location of the split.
G. Biau (UPMC) 72 / 106
129. Discussion — Choosing the pnj ’s
Near-reality scenario
The following splitting scheme is iteratively repeated at each node:
1 Select at random Mn candidate coordinates to split on.
2 For each of the Mn elected coordinates, calculate the best split.
3 Select the coordinate which outputs the best within-node sum of
squares decrease, and cut.
Conclusion
For j ∈ S,
1
pnj ≈ 1 + ξnj ,
S
where ξnj → 0 and satisfies the constraint ξnj log n → 0 as n tends to
infinity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞.
G. Biau (UPMC) 73 / 106
130. Discussion — Choosing the pnj ’s
Near-reality scenario
The following splitting scheme is iteratively repeated at each node:
1 Select at random Mn candidate coordinates to split on.
2 For each of the Mn elected coordinates, calculate the best split.
3 Select the coordinate which outputs the best within-node sum of
squares decrease, and cut.
Conclusion
For j ∈ S,
1
pnj ≈ 1 + ξnj ,
S
where ξnj → 0 and satisfies the constraint ξnj log n → 0 as n tends to
infinity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞.
G. Biau (UPMC) 73 / 106
131. Discussion — Choosing the pnj ’s
Near-reality scenario
The following splitting scheme is iteratively repeated at each node:
1 Select at random Mn candidate coordinates to split on.
2 For each of the Mn elected coordinates, calculate the best split.
3 Select the coordinate which outputs the best within-node sum of
squares decrease, and cut.
Conclusion
For j ∈ S,
1
pnj ≈ 1 + ξnj ,
S
where ξnj → 0 and satisfies the constraint ξnj log n → 0 as n tends to
infinity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞.
G. Biau (UPMC) 73 / 106
132. Discussion — Choosing the pnj ’s
Near-reality scenario
The following splitting scheme is iteratively repeated at each node:
1 Select at random Mn candidate coordinates to split on.
2 For each of the Mn elected coordinates, calculate the best split.
3 Select the coordinate which outputs the best within-node sum of
squares decrease, and cut.
Conclusion
For j ∈ S,
1
pnj ≈ 1 + ξnj ,
S
where ξnj → 0 and satisfies the constraint ξnj log n → 0 as n tends to
infinity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞.
G. Biau (UPMC) 73 / 106
133. Discussion — Choosing the pnj ’s
Near-reality scenario
The following splitting scheme is iteratively repeated at each node:
1 Select at random Mn candidate coordinates to split on.
2 For each of the Mn elected coordinates, calculate the best split.
3 Select the coordinate which outputs the best within-node sum of
squares decrease, and cut.
Conclusion
For j ∈ S,
1
pnj ≈ 1 + ξnj ,
S
where ξnj → 0 and satisfies the constraint ξnj log n → 0 as n tends to
infinity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞.
G. Biau (UPMC) 73 / 106
134. Outline
1 Setting
2 A random forests model
3 Layered nearest neighbors and random forests
G. Biau (UPMC) 74 / 106
159. Layered Nearest Neighbors
Definition
Let X1 , . . . , Xn be a sample of i.i.d. random vectors in Rd , d ≥ 2. An
observation Xi is said to be a LNN of a point x if the hyperrectangle
defined by x and Xi contains no other data points.
x
empty
G. Biau (UPMC) 99 / 106
160. What is known about Ln (x)?
... a lot when X1 , . . . , Xn are uniformly distributed over [0, 1]d .
For example,
2d (log n)d−1
ELn (x) = + O (log n)d−2
(d − 1)!
and
(d − 1)! Ln (x)
→ 1 in probability as n → ∞.
2d (log n)d−1
This is the problem of maxima in random vectors
(Barndorff-Nielsen and Sobel, 1966).
G. Biau (UPMC) 100 / 106
161. What is known about Ln (x)?
... a lot when X1 , . . . , Xn are uniformly distributed over [0, 1]d .
For example,
2d (log n)d−1
ELn (x) = + O (log n)d−2
(d − 1)!
and
(d − 1)! Ln (x)
→ 1 in probability as n → ∞.
2d (log n)d−1
This is the problem of maxima in random vectors
(Barndorff-Nielsen and Sobel, 1966).
G. Biau (UPMC) 100 / 106
162. What is known about Ln (x)?
... a lot when X1 , . . . , Xn are uniformly distributed over [0, 1]d .
For example,
2d (log n)d−1
ELn (x) = + O (log n)d−2
(d − 1)!
and
(d − 1)! Ln (x)
→ 1 in probability as n → ∞.
2d (log n)d−1
This is the problem of maxima in random vectors
(Barndorff-Nielsen and Sobel, 1966).
G. Biau (UPMC) 100 / 106
163. Two results (Biau and Devroye, 2010)
Model
X1 , . . . , Xn are independently distributed according to some probability
density f (with probability measure µ).
Theorem
For µ-almost all x ∈ Rd , one has
Ln (x) → ∞ in probability as n → ∞.
Theorem
Suppose that f is λ-almost everywhere continuous. Then
(d − 1)! ELn (x)
→ 1 as n → ∞,
2d (log n)d−1
at µ-almost all x ∈ Rd .
G. Biau (UPMC) 101 / 106
164. Two results (Biau and Devroye, 2010)
Model
X1 , . . . , Xn are independently distributed according to some probability
density f (with probability measure µ).
Theorem
For µ-almost all x ∈ Rd , one has
Ln (x) → ∞ in probability as n → ∞.
Theorem
Suppose that f is λ-almost everywhere continuous. Then
(d − 1)! ELn (x)
→ 1 as n → ∞,
2d (log n)d−1
at µ-almost all x ∈ Rd .
G. Biau (UPMC) 101 / 106
165. Two results (Biau and Devroye, 2010)
Model
X1 , . . . , Xn are independently distributed according to some probability
density f (with probability measure µ).
Theorem
For µ-almost all x ∈ Rd , one has
Ln (x) → ∞ in probability as n → ∞.
Theorem
Suppose that f is λ-almost everywhere continuous. Then
(d − 1)! ELn (x)
→ 1 as n → ∞,
2d (log n)d−1
at µ-almost all x ∈ Rd .
G. Biau (UPMC) 101 / 106
166. LNN regression estimation
Model
(X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R.
Moreover, |Y | is bounded and X has a density.
The regression function r (x) = E[Y |X = x] may be estimated by
n
1
rn (x) = Yi 1[Xi ∈Ln (x)] .
Ln (x)
i=1
1 No smoothing parameter.
2 A scale-invariant estimate.
3 Intimately connected to Breiman’s random forests.
G. Biau (UPMC) 102 / 106
167. LNN regression estimation
Model
(X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R.
Moreover, |Y | is bounded and X has a density.
The regression function r (x) = E[Y |X = x] may be estimated by
n
1
rn (x) = Yi 1[Xi ∈Ln (x)] .
Ln (x)
i=1
1 No smoothing parameter.
2 A scale-invariant estimate.
3 Intimately connected to Breiman’s random forests.
G. Biau (UPMC) 102 / 106
168. LNN regression estimation
Model
(X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R.
Moreover, |Y | is bounded and X has a density.
The regression function r (x) = E[Y |X = x] may be estimated by
n
1
rn (x) = Yi 1[Xi ∈Ln (x)] .
Ln (x)
i=1
1 No smoothing parameter.
2 A scale-invariant estimate.
3 Intimately connected to Breiman’s random forests.
G. Biau (UPMC) 102 / 106
169. LNN regression estimation
Model
(X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R.
Moreover, |Y | is bounded and X has a density.
The regression function r (x) = E[Y |X = x] may be estimated by
n
1
rn (x) = Yi 1[Xi ∈Ln (x)] .
Ln (x)
i=1
1 No smoothing parameter.
2 A scale-invariant estimate.
3 Intimately connected to Breiman’s random forests.
G. Biau (UPMC) 102 / 106
170. LNN regression estimation
Model
(X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R.
Moreover, |Y | is bounded and X has a density.
The regression function r (x) = E[Y |X = x] may be estimated by
n
1
rn (x) = Yi 1[Xi ∈Ln (x)] .
Ln (x)
i=1
1 No smoothing parameter.
2 A scale-invariant estimate.
3 Intimately connected to Breiman’s random forests.
G. Biau (UPMC) 102 / 106
171. Consistency
Theorem (Pointwise Lp -consistency)
Assume that the regression function r is λ-almost everywhere
continuous and that Y is bounded. Then, for µ-almost all x ∈ Rd and
all p ≥ 1,
E |rn (x) − r (x)|p → 0 as n → ∞.
Theorem (Gobal Lp -consistency)
Under the same conditions, for all p ≥ 1,
E |rn (X) − r (X)|p → 0 as n → ∞.
1 No universal consistency result with respect to r is possible.
2 The results do not impose any condition on the density.
3 They are also scale-free.
G. Biau (UPMC) 103 / 106
172. Consistency
Theorem (Pointwise Lp -consistency)
Assume that the regression function r is λ-almost everywhere
continuous and that Y is bounded. Then, for µ-almost all x ∈ Rd and
all p ≥ 1,
E |rn (x) − r (x)|p → 0 as n → ∞.
Theorem (Gobal Lp -consistency)
Under the same conditions, for all p ≥ 1,
E |rn (X) − r (X)|p → 0 as n → ∞.
1 No universal consistency result with respect to r is possible.
2 The results do not impose any condition on the density.
3 They are also scale-free.
G. Biau (UPMC) 103 / 106
173. Consistency
Theorem (Pointwise Lp -consistency)
Assume that the regression function r is λ-almost everywhere
continuous and that Y is bounded. Then, for µ-almost all x ∈ Rd and
all p ≥ 1,
E |rn (x) − r (x)|p → 0 as n → ∞.
Theorem (Gobal Lp -consistency)
Under the same conditions, for all p ≥ 1,
E |rn (X) − r (X)|p → 0 as n → ∞.
1 No universal consistency result with respect to r is possible.
2 The results do not impose any condition on the density.
3 They are also scale-free.
G. Biau (UPMC) 103 / 106
174. Consistency
Theorem (Pointwise Lp -consistency)
Assume that the regression function r is λ-almost everywhere
continuous and that Y is bounded. Then, for µ-almost all x ∈ Rd and
all p ≥ 1,
E |rn (x) − r (x)|p → 0 as n → ∞.
Theorem (Gobal Lp -consistency)
Under the same conditions, for all p ≥ 1,
E |rn (X) − r (X)|p → 0 as n → ∞.
1 No universal consistency result with respect to r is possible.
2 The results do not impose any condition on the density.
3 They are also scale-free.
G. Biau (UPMC) 103 / 106
175. Consistency
Theorem (Pointwise Lp -consistency)
Assume that the regression function r is λ-almost everywhere
continuous and that Y is bounded. Then, for µ-almost all x ∈ Rd and
all p ≥ 1,
E |rn (x) − r (x)|p → 0 as n → ∞.
Theorem (Gobal Lp -consistency)
Under the same conditions, for all p ≥ 1,
E |rn (X) − r (X)|p → 0 as n → ∞.
1 No universal consistency result with respect to r is possible.
2 The results do not impose any condition on the density.
3 They are also scale-free.
G. Biau (UPMC) 103 / 106
176. Back to random forests
A random forest can be viewed as a weighted LNN regression estimate
n
¯n (x) =
r Yi Wni (x),
i=1
where the weights concentrate on the LNN and satisfy
n
Wni (x) = 1.
i=1
G. Biau (UPMC) 104 / 106
177. Non-adaptive strategies
Consider the non-adaptive random forests estimate
n
¯n (x) =
r Yi Wni (x).
i=1
where the weights concentrate on the LNN.
Proposition
For any x ∈ Rd , assume that σ 2 = V[Y |X = x] is independent of x.
Then
σ2
E [¯n (x) − r (x)]2 ≥
r .
ELn (x)
G. Biau (UPMC) 105 / 106
178. Non-adaptive strategies
Consider the non-adaptive random forests estimate
n
¯n (x) =
r Yi Wni (x).
i=1
where the weights concentrate on the LNN.
Proposition
For any x ∈ Rd , assume that σ 2 = V[Y |X = x] is independent of x.
Then
σ2
E [¯n (x) − r (x)]2 ≥
r .
ELn (x)
G. Biau (UPMC) 105 / 106
179. Rate of convergence
At µ-almost all x, when f is λ-almost everywhere continuous,
σ 2 (d − 1)!
E [¯n (x) − r (x)]2
r .
2d (log n)d−1
Improving the rate of convergence
1 Stop as soon as a future rectangle split would cause a
sub-rectangle to have fewer than kn points.
2 Resort to bagging and randomize using random subsamples.
G. Biau (UPMC) 106 / 106
180. Rate of convergence
At µ-almost all x, when f is λ-almost everywhere continuous,
σ 2 (d − 1)!
E [¯n (x) − r (x)]2
r .
2d (log n)d−1
Improving the rate of convergence
1 Stop as soon as a future rectangle split would cause a
sub-rectangle to have fewer than kn points.
2 Resort to bagging and randomize using random subsamples.
G. Biau (UPMC) 106 / 106
181. Rate of convergence
At µ-almost all x, when f is λ-almost everywhere continuous,
σ 2 (d − 1)!
E [¯n (x) − r (x)]2
r .
2d (log n)d−1
Improving the rate of convergence
1 Stop as soon as a future rectangle split would cause a
sub-rectangle to have fewer than kn points.
2 Resort to bagging and randomize using random subsamples.
G. Biau (UPMC) 106 / 106