Présentation G.Biau Random Forests

1. Random forests Gérard Biau Groupe de travail prévision, May 2011 G. Biau (UPMC) 1 / 106

2. Outline 1 Setting 2 A random forests model 3 Layered nearest neighbors and random forests G. Biau (UPMC) 2 / 106

4. Leo Breiman (1928-2005) G. Biau (UPMC) 4 / 106

5. Leo Breiman (1928-2005) G. Biau (UPMC) 5 / 106

6. Classiﬁcation G. Biau (UPMC) 6 / 106

7. Classiﬁcation ? G. Biau (UPMC) 7 / 106

8. Classiﬁcation G. Biau (UPMC) 8 / 106

9. Regression −2.33 1.38 6.76 5.89 12.00 0.56 2.27 0.12 4.73 31.2 −2.88 −2.56 7.72 7.26 0.12 4.32 2.23 7.77 −1.23 3.82 1.09 8.21 −2.66 G. Biau (UPMC) 9 / 106

10. Regression −2.33 1.38 6.76 5.89 12.00 0.56 2.27 0.12 4.73 31.2 −2.88 7.72 ? −2.56 7.26 0.12 4.32 2.23 7.77 −1.23 3.82 1.09 8.21 −2.66 G. Biau (UPMC) 10 / 106

11. Regression −2.33 1.38 6.76 5.89 12.00 0.56 2.27 0.12 4.73 31.2 −2.88 2.25 −2.56 7.72 7.26 0.12 4.32 2.23 7.77 −1.23 3.82 1.09 8.21 −2.66 G. Biau (UPMC) 11 / 106

12. Trees ? G. Biau (UPMC) 12 / 106

19. Trees G. Biau (UPMC) 19 / 106

34. From trees to forests Leo Breiman promoted random forests. Idea: Using tree averaging as a means of obtaining good rules. The base trees are simple and randomized. Breiman’s ideas were decisively inﬂuenced by Amit and Geman (1997, geometric feature selection). Ho (1998, random subspace method). Dietterich (2000, random split selection approach). stat.berkeley.edu/users/breiman/RandomForests G. Biau (UPMC) 34 / 106

39. Random forests They have emerged as serious competitors to state of the art methods. They are fast and easy to implement, produce highly accurate predictions and can handle a very large number of input variables without overﬁtting. In fact, forests are among the most accurate general-purpose learners available. The algorithm is difﬁcult to analyze and its mathematical properties remain to date largely unknown. Most theoretical studies have concentrated on isolated parts or stylized versions of the procedure. G. Biau (UPMC) 35 / 106

44. Key-references Breiman (2000, 2001, 2004). Deﬁnition, experiments and intuitions. Lin and Jeon (2006). Link with layered nearest neighbors. Biau, Devroye and Lugosi (2008). Consistency results for stylized versions. Biau (2010). Sparsity and random forests. G. Biau (UPMC) 36 / 106

49. Three basic ingredients 1-Randomization and no-pruning For each tree, select at random, at each node, a small group of input coordinates to split. Calculate the best split based on these features and cut. The tree is grown to maximum size, without pruning. G. Biau (UPMC) 38 / 106

52. Three basic ingredients 2-Aggregation Final predictions are obtained by aggregating over the ensemble. It is fast and easily parallelizable. G. Biau (UPMC) 39 / 106

53. Three basic ingredients 2-Aggregation Final predictions are obtained by aggregating over the ensemble. It is fast and easily parallelizable. G. Biau (UPMC) 39 / 106

54. Three basic ingredients 3-Bagging The subspace randomization scheme is blended with bagging. Breiman (1996). Bühlmann and Yu (2002). Biau, Cérou and Guyader (2010). G. Biau (UPMC) 40 / 106

55. Three basic ingredients 3-Bagging The subspace randomization scheme is blended with bagging. Breiman (1996). Bühlmann and Yu (2002). Biau, Cérou and Guyader (2010). G. Biau (UPMC) 40 / 106

56. Mathematical framework A training sample: Dn = {(X1 , Y1 ), . . . , (Xn , Yn )} i.i.d. [0, 1]d × R-valued random variables. A generic pair: (X, Y ) satisfying EY 2 < ∞. Our mission: For ﬁxed x ∈ [0, 1]d , estimate the regression function r (x) = E[Y |X = x] using the data. Quality criterion: E[rn (X) − r (X)]2 . G. Biau (UPMC) 41 / 106

60. The model A random forest is a collection of randomized base regression trees {rn (x, Θm , Dn ), m ≥ 1}. These random trees are combined to form the aggregated regression estimate ¯n (X, Dn ) = EΘ [rn (X, Θ, Dn )] . r Θ is assumed to be independent of X and the training sample Dn . However, we allow Θ to be based on a test sample, independent of, but distributed as, Dn . G. Biau (UPMC) 42 / 106

64. The procedure Fix kn ≥ 2 and repeat the following procedure log2 kn times: 1 At each node, a coordinate of X = (X (1) , . . . , X (d) ) is selected, with the j-th feature having a probability pnj ∈ (0, 1) of being selected. 2 At each node, once the coordinate is selected, the split is at the midpoint of the chosen side. Thus n i=1 Yi 1[Xi ∈An (X,Θ)] ¯n (X) = EΘ r n 1En (X,Θ) , i=1 1[Xi ∈An (X,Θ)] where n En (X, Θ) = 1[Xi ∈An (X,Θ)] = 0 . i=1 G. Biau (UPMC) 43 / 106

67. Binary trees G. Biau (UPMC) 44 / 106

72. General comments Each individual tree has exactly 2 log2 kn (≈ kn ) terminal nodes, and each leaf has Lebesgue measure 2− log2 kn (≈ 1/kn ). If X has uniform distribution on [0, 1]d , there will be on average about n/kn observations per terminal node. The choice kn = n induces a very small number of cases in the ﬁnal leaves. This scheme is close to what the original random forests algorithm does. G. Biau (UPMC) 49 / 106

76. Consistency Theorem Assume that the distribution of X has support on [0, 1]d . Then the random forests estimate ¯n is consistent whenever pnj log kn → ∞ for r all j = 1, . . . , d and kn /n → 0 as n → ∞. In the purely random model, pnj = 1/d, independently of n and j, and consistency is ensured as long as kn → ∞ and kn /n → 0. This is however a radically simpliﬁed version of the random forests used in practice. A more in-depth analysis is needed. G. Biau (UPMC) 50 / 106

80. Sparsity There is empirical evidence that many signals in high-dimensional spaces admit a sparse representation. Images wavelet coefﬁcients. High-throughput technologies. Sparse estimation is playing an increasingly important role in the statistics and machine learning communities. Several methods have recently been developed in both ﬁelds, which rely upon the notion of sparsity. G. Biau (UPMC) 51 / 106

85. Our vision The regression function r (X) = E[Y |X] depends in fact only on a nonempty subset S (for Strong) of the d features. In other words, letting XS = (Xj : j ∈ S) and S = Card S, we have r (X) = E[Y |XS ]. In the dimension reduction scenario we have in mind, the ambient dimension d can be very large, much larger than n. As such, the value S characterizes the sparsity of the model: The smaller S, the sparser r . G. Biau (UPMC) 52 / 106

89. Sparsity and random forests Ideally, pnj = 1/S for j ∈ S. To stick to reality, we will rather require that pnj = (1/S)(1 + ξnj ). Such a randomization mechanism may be designed on the basis of a test sample. Action plan E [¯n (X) − r (X)]2 = E [¯n (X) − ñ (X)]2 + E [ñ (X) − r (X)]2 , r r r r where n ñ (X) = r EΘ [Wni (X, Θ)] r (Xi ). i=1 G. Biau (UPMC) 53 / 106

93. Variance Proposition Assume that X is uniformly distributed on [0, 1]d and, for all x ∈ Rd , σ 2 (x) = V[Y | X = x] ≤ σ 2 for some positive constant σ 2 . Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S, S/2d S2 kn E [¯n (X) − ˜n (X)]2 ≤ Cσ 2 r r (1 + ξn ) , S−1 n(log kn )S/2d where S/2d 288 π log 2 C= . π 16 G. Biau (UPMC) 54 / 106

94. Bias Proposition Assume that X is uniformly distributed on [0, 1]d and r is L-Lipschitz. Then, if pnj = (1/S)(1 + ξnj ) for j ∈ S, 2SL2 E [˜n (X) − r (X)]2 ≤ r 0.75 + sup r 2 (x) e−n/2kn . (1+γn ) kn S log 2 x∈[0,1]d The rate at which the bias decreases to 0 depends on the number of strong variables, not on d. kn −(0.75/(S log 2))(1+γn ) = o(kn −2/d ) as soon as S ≤ 0.54d . The term e−n/2kn prevents the extreme choice kn = n. G. Biau (UPMC) 55 / 106

98. Main result Theorem If pnj = (1/S)(1 + ξnj ) for j ∈ S, with ξnj log n → 0 as n → ∞, then for the choice 1/(1+ S0.752 ) L2 log 1/(1+ S0.752 ) kn ∝ n log , Ξ we have E [¯n (X) − r (X)]2 r lim sup sup 0.75 ≤ Λ. n→∞ (X,Y )∈F 2S ln 2 S ln 2+0.75 −0.75 ΞL 0.75 n S log 2+0.75 Take-home message −0.75 The rate n S log 2+0.75 is strictly faster than the usual minimax rate n−2/(d+2) as soon as S ≤ 0.54d . G. Biau (UPMC) 56 / 106

99. Main result Theorem If pnj = (1/S)(1 + ξnj ) for j ∈ S, with ξnj log n → 0 as n → ∞, then for the choice 1/(1+ S0.752 ) L2 log 1/(1+ S0.752 ) kn ∝ n log , Ξ we have E [¯n (X) − r (X)]2 r lim sup sup 0.75 ≤ Λ. n→∞ (X,Y )∈F 2S ln 2 S ln 2+0.75 −0.75 ΞL 0.75 n S log 2+0.75 Take-home message −0.75 The rate n S log 2+0.75 is strictly faster than the usual minimax rate n−2/(d+2) as soon as S ≤ 0.54d . G. Biau (UPMC) 56 / 106

100. Dimension reduction 0.4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0.0196 0 2 10 20 30 40 50 60 70 80 90 100 S G. Biau (UPMC) 57 / 106

101. Discussion — Bagging The optimal parameter kn depends on the unknown distribution of (X, Y ). To correct this situation, adaptive choices of kn should preserve the rate of convergence of the estimate. Another route we may follow is to analyze the effect of bagging. Biau, Cérou and Guyader (2010). G. Biau (UPMC) 58 / 106

105. ? G. Biau (UPMC) 59 / 106

106. ? G. Biau (UPMC) 60 / 106

107. G. Biau (UPMC) 61 / 106

108. ? G. Biau (UPMC) 62 / 106

109. ? G. Biau (UPMC) 63 / 106

110. ? G. Biau (UPMC) 64 / 106

111. G. Biau (UPMC) 65 / 106

112. ? G. Biau (UPMC) 66 / 106

113. ? G. Biau (UPMC) 67 / 106

114. ? G. Biau (UPMC) 68 / 106

115. G. Biau (UPMC) 69 / 106

116. Discussion — Choosing the pnj ’s Imaginary scenario The following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 If the selection is all weak, then choose one at random to split on. 3 If there is more than one strong variable elected, choose one at random and cut. G. Biau (UPMC) 70 / 106

120. Discussion — Choosing the pnj ’s Each coordinate in S will be cut with the “ideal” probability Mn 1 S pn = 1− 1− . S d The parameter Mn should satisfy Mn S 1− log n → 0 as n → ∞. d This is true as soon as Mn Mn → ∞ and → ∞ as n → ∞. log n G. Biau (UPMC) 71 / 106

123. Assumptions We have at hand an independent test set Dn . The model is linear: Y = aj X (j) + ε. j∈S For a ﬁxed node A = d Aj , ﬁx a coordinate j and look at the j=1 weighted conditional variance V[Y |X (j) ∈ Aj ] P(X (j) ∈ Aj ). If j ∈ S, then the best split is at the midpoint of the node, with a variance decrease equal to aj2 /16 > 0. If j ∈ W, the decrease of the variance is always 0, whatever the location of the split. G. Biau (UPMC) 72 / 106

129. Discussion — Choosing the pnj ’s Near-reality scenario The following splitting scheme is iteratively repeated at each node: 1 Select at random Mn candidate coordinates to split on. 2 For each of the Mn elected coordinates, calculate the best split. 3 Select the coordinate which outputs the best within-node sum of squares decrease, and cut. Conclusion For j ∈ S, 1 pnj ≈ 1 + ξnj , S where ξnj → 0 and satisﬁes the constraint ξnj log n → 0 as n tends to inﬁnity, provided kn log n/n → 0, Mn → ∞ and Mn / log n → ∞. G. Biau (UPMC) 73 / 106

135. ? G. Biau (UPMC) 75 / 106

136. ? G. Biau (UPMC) 76 / 106

137. ? G. Biau (UPMC) 77 / 106

138. ? G. Biau (UPMC) 78 / 106

139. ? G. Biau (UPMC) 79 / 106

140. G. Biau (UPMC) 80 / 106

141. ? G. Biau (UPMC) 81 / 106

142. ? G. Biau (UPMC) 82 / 106

143. ? G. Biau (UPMC) 83 / 106

144. ? G. Biau (UPMC) 84 / 106

145. ? G. Biau (UPMC) 85 / 106

146. G. Biau (UPMC) 86 / 106

147. ? G. Biau (UPMC) 87 / 106

148. ? G. Biau (UPMC) 88 / 106

149. ? G. Biau (UPMC) 89 / 106

150. ? G. Biau (UPMC) 90 / 106

151. ? G. Biau (UPMC) 91 / 106

152. ? G. Biau (UPMC) 92 / 106

153. ? G. Biau (UPMC) 93 / 106

154. ? G. Biau (UPMC) 94 / 106

155. ? G. Biau (UPMC) 95 / 106

156. ? G. Biau (UPMC) 96 / 106

157. ? G. Biau (UPMC) 97 / 106

158. ? G. Biau (UPMC) 98 / 106

159. Layered Nearest Neighbors Deﬁnition Let X1 , . . . , Xn be a sample of i.i.d. random vectors in Rd , d ≥ 2. An observation Xi is said to be a LNN of a point x if the hyperrectangle deﬁned by x and Xi contains no other data points. x empty G. Biau (UPMC) 99 / 106

160. What is known about Ln (x)? ... a lot when X1 , . . . , Xn are uniformly distributed over [0, 1]d . For example, 2d (log n)d−1 ELn (x) = + O (log n)d−2 (d − 1)! and (d − 1)! Ln (x) → 1 in probability as n → ∞. 2d (log n)d−1 This is the problem of maxima in random vectors (Barndorff-Nielsen and Sobel, 1966). G. Biau (UPMC) 100 / 106

163. Two results (Biau and Devroye, 2010) Model X1 , . . . , Xn are independently distributed according to some probability density f (with probability measure µ). Theorem For µ-almost all x ∈ Rd , one has Ln (x) → ∞ in probability as n → ∞. Theorem Suppose that f is λ-almost everywhere continuous. Then (d − 1)! ELn (x) → 1 as n → ∞, 2d (log n)d−1 at µ-almost all x ∈ Rd . G. Biau (UPMC) 101 / 106

166. LNN regression estimation Model (X, Y ), (X1 , Y1 ), . . . , (Xn , Yn ) are i.i.d. random vectors of Rd × R. Moreover, |Y | is bounded and X has a density. The regression function r (x) = E[Y |X = x] may be estimated by n 1 rn (x) = Yi 1[Xi ∈Ln (x)] . Ln (x) i=1 1 No smoothing parameter. 2 A scale-invariant estimate. 3 Intimately connected to Breiman’s random forests. G. Biau (UPMC) 102 / 106

171. Consistency Theorem (Pointwise Lp -consistency) Assume that the regression function r is λ-almost everywhere continuous and that Y is bounded. Then, for µ-almost all x ∈ Rd and all p ≥ 1, E |rn (x) − r (x)|p → 0 as n → ∞. Theorem (Gobal Lp -consistency) Under the same conditions, for all p ≥ 1, E |rn (X) − r (X)|p → 0 as n → ∞. 1 No universal consistency result with respect to r is possible. 2 The results do not impose any condition on the density. 3 They are also scale-free. G. Biau (UPMC) 103 / 106

176. Back to random forests A random forest can be viewed as a weighted LNN regression estimate n ¯n (x) = r Yi Wni (x), i=1 where the weights concentrate on the LNN and satisfy n Wni (x) = 1. i=1 G. Biau (UPMC) 104 / 106

177. Non-adaptive strategies Consider the non-adaptive random forests estimate n ¯n (x) = r Yi Wni (x). i=1 where the weights concentrate on the LNN. Proposition For any x ∈ Rd , assume that σ 2 = V[Y |X = x] is independent of x. Then σ2 E [¯n (x) − r (x)]2 ≥ r . ELn (x) G. Biau (UPMC) 105 / 106

178. Non-adaptive strategies Consider the non-adaptive random forests estimate n ¯n (x) = r Yi Wni (x). i=1 where the weights concentrate on the LNN. Proposition For any x ∈ Rd , assume that σ 2 = V[Y |X = x] is independent of x. Then σ2 E [¯n (x) − r (x)]2 ≥ r . ELn (x) G. Biau (UPMC) 105 / 106

179. Rate of convergence At µ-almost all x, when f is λ-almost everywhere continuous, σ 2 (d − 1)! E [¯n (x) − r (x)]2 r . 2d (log n)d−1 Improving the rate of convergence 1 Stop as soon as a future rectangle split would cause a sub-rectangle to have fewer than kn points. 2 Resort to bagging and randomize using random subsamples. G. Biau (UPMC) 106 / 106

Présentation G.Biau Random Forests

Recommandé

Recommandé

Contenu connexe

En vedette

En vedette (20)

Plus de Cdiscount

Plus de Cdiscount (17)

Dernier

Dernier (20)

Présentation G.Biau Random Forests