Random Forests 不徹底入門

Random Forests不徹底入門

@zgmfx20a
2011/4/23 Osaka.R #5
株式会社ロックオンセミナー室

はじめに
• 自己紹介

• お願い
– これは個人としての発表です
– 多々ある間違いをご指摘頂ければ幸いです

Random Forests
• アンサンブル学習法
– 無作為抽出/バギング/決定木と回帰木
• 教師あり/なし判別と回帰
• 生成した近接行列による類似度の可視化
• OOB(Out-Of-Bag)による交差検証の推定
• 重要度による変数選抜
• http://www.stat.berkeley.edu/~breiman/
• 筆者らは2004年にアミノ酸配列解析に応用

Random Forests誕生の背景
(2009年日本計量生物学会から)
• L. Breimanは大学をリタイアしてコンサルティ
ングの業務に就いた
• しかし現実のデータ解釈に従来の統計手法
はうまく適合しなかった
• よくよく調べてみるとこれらはノンパラメトリッ
クな分布ばかりだった
• 苦心の末2001年Random Forestsを完成
– L. Breimanの誕生年は1928年、つまり …
• ちなみに没年は2005年

Random Forest or Forests?
• 原著はRandom Forests
– http://www.stat.berkeley.edu/~breiman/RandomForests/

• Google R. Forest > R. Forests
• Google scholar R. Forest < R. Forests
• Pubmed R. Forest > R. Forests
• 最近はRandom Forestsの方が多い？
– 某第二版ではRandom Forestsに変更
• もちろんそれが改訂の全てではないです … ^_^;;

Discrimination for iris species
Sepal.Length Sepal.Width Petal.Length Petal.Width Species iris
1
2
5.1
4.9
3.5
3.0
1.4
1.4
0.2
0.2
setosa
setosa
4 Explanatory variable
3
4
4.7
4.6
3.2
3.1
1.3
1.5
0.2
0.2
setosa
setosa
3 Species
5 5.0 3.6 1.4 0.2 setosa
…
51 7.0 3.2 4.7 1.4 versicolor
52
53
6.4
6.9
3.2
3.1
4.5
4.9
1.5
1.5
versicolor
versicolor
4D scatter plot
54 5.5 2.3 4.0 1.3 versicolor
55 6.5 2.8 4.6 1.5 versicolor
…
101 6.3 3.3 6.0 2.5 virginica
102 5.8 2.7 5.1 1.9 virginica
103 7.1 3.0 5.9 2.1 virginica
104 6.3 2.9 5.6 1.8 virginica
105 6.5 3.0 5.8 2.2 virginica

Make decision tree
Virginica

Versicolor
Setosa

Decision tree – Setosa discriminate
Virginica
Petal.Length < 2.6

Setosa Other

Versicolor
Setosa

Decision tree – Virginica discriminate
Virginica
Petal.Length < 2.6

Petal.Width > 1.7
Setosa

Other Virginica

Versicolor
Setosa

Decision tree – Versicolor discriminate
Virginica
Petal.Length < 2.6

Petal.Width > 1.7
Setosa

Petal.Length < 5.2
Virginica

Versicolor
Setosa

Versicolor Virginica

randomForestパッケージ/CRAN
• randomForest
– 説明変数[、目的変数、パラメータ、スイッチ]
• print.randomForest
• MDSplot
• plot.randomForest
• varImpPlot / importance.randomForest

randomForest関数
• 説明変数と目的変数を別々に与える
– randomForest.defaultメソッドをキック
• 第一引数がformulaでない
• データフレームでモデル式を使う
– カラム ~ . (例 Species ~ .)
• カラムが目的変数/それ以外が説明変数
– randomForest.formulaメソッドをキック
• 説明変数と目的変数を生成しrandomForest.defaultをキック
• 目的変数のクラスによって処理を分岐
– 目的変数がなければ教師なし判別
– 目的変数がfactorならば教師あり判別
• そうでなければ回帰

mtryとntree
• mtry
– ブートストラップで得られる変数の数
– 判別の場合x^0.5/回帰の場合x/3
mtry = if (!is.null(y) && !is.factor(y))
max(floor(ncol(x)/3),1)
else floor(sqrt(ncol(x)))

• ntree
– 生成させる決定木もしくは回帰木の数
– デフォルトは500

学習データの与え方
• 横軸に説明変数と目的変数(応答)を指定する
• 別々に与えるかデータフレームでまとめるか
> set.seed(123)
> iris[sample(1:150,10),]
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
44 5.0 3.5 1.6 0.6 setosa
118 7.7 3.8 6.7 2.2 virginica
61 5.0 2.0 3.5 1.0 versicolor
130 7.2 3.0 5.8 1.6 virginica
138 6.4 3.1 5.5 1.8 virginica
7 4.6 3.4 1.4 0.3 setosa
77 6.8 2.8 4.8 1.4 versicolor
128 6.1 3.0 4.9 1.8 virginica
79 6.0 2.9 4.5 1.5 versicolor
65 5.6 2.9 3.6 1.3 versicolor

Random Forests 教師なし判別
• 目的変数は与えない
• MDSplotで類似性を可視化

> library(randomForest)
> iris.urf = randomForest(iris[,-5], ntree=200)
> iris.urf

Call:
randomForest(x = iris[, -5], ntree = 200)
Type of random forest: unsupervised
Number of trees: 200
No. of variables tried at each split: 2

> sqrt(ncol(iris[,-5]))
[1] 2

Random Forests 教師あり判別
• 弱い決定木を複数個生成し多数決
> class(iris$Species)
[1] "factor"
> (iris.crf = randomForest(Species ~ ., data=iris))

Call:
randomForest(formula = Species ~ ., data = iris)
Type of random forest: classification

OOB estimate of error rate: 4.67%
Confusion matrix:
setosa versicolor virginica class.error
setosa 50 0 0 0.00
versicolor 0 47 3 0.06
virginica 0 4 46 0.08

Random Forests 回帰
• 弱い回帰木を複数個生成し平均
> data(BloodBrain, package="caret")
> class(logBBB)
[1] "numeric"
> (bbbDescr.rrf <- randomForest(bbbDescr, logBBB))

Call:
randomForest(x = bbbDescr, y = logBBB)
Type of random forest: regression

Mean of squared residuals: 0.2632423
% Var explained: 56.45
> max(floor(ncol(bbbDescr)/3), 1)
[1] 44

学習の種類と重要度出力
• 教師なし判別
– MeanDecreasingGini
• 教師あり判別
– MeanDecreasingAccuracy
• マルチクラスの場合クラス毎の重要度を追加
– MeanDecreasingGini
• 回帰
– MeanSquareResidual

MDSplot
• 判別学習で生成される近接行列を可視化
– オブジェクトのproximity属性に格納
• データ数xデータ数からなるmatrix
– 多次元尺度構成法により次元縮約
• cmdscale(1 - rf$proximity, eig = TRUE, k = k)
• 近接行列は判別分析の場合に生成
– 教師あり判別分析の場合はproxymity=T必要
• MDSplotにより判別学習の結果を可視化
– 凡例作成はlegend関数で実施

cmdscaleとMDSplotの差異

> plot(cmdscale(iris[,-5]), > iris.rf=randomForest(iris[,-5])
+ cex=2, > MDSplot(iris.rf, fac=iris[,5]
+ pch=as.numeric(iris[,5]) + cex=2,
+ pch=as.numeric(iris[,5])

MDSggplot ☺
• ggplot2でMDSplotを実現 > iris.crf <- randomForest(
+ iris[,-5], iris[,5],
– レジェンドの自動出力 + prox=T)
> makeMDSggplot(iris.crf,
makeMDSggplot <- function (rf, fac, pch = 1, size=2) { + iris$Species, size=5)
require(ggplot2)
n <- nlevels(fac)
m <- pch + n - 1
rf.mds <- stats:::cmdscale(1 - rf$proximity,
eig = TRUE, k = 2)
rf.mds.ggplot = as.data.frame(rf.mds$points)
rf.mds.ggplot$class = fac
colnames(rf.mds.ggplot) <- c("X", "Y", "class")
p <- ggplot(data=rf.mds.ggplot, aes(X, Y)) +
geom_point(aes(shape=class), size=size) +
scale_shape(solid=F) +
scale_shape_manual(values=pch:m)
p + labs(x="Dim 1", y="Dim 2")
}

OOB(Out-of-bag)
• N fold 交差検証の推定値
– ntreeの数は120あたりで収束
> library(caret)
> data(mdrr)
> mdrr.crf = randomForest(mdrrDescr,
+ mdrrClass)
> mdrr.crf
…
OOB estimate of error rate: 16.48%
…
> dim(mdrr.crf$err.rate)
[1] 500 3
> colnames(mdrr.crf$err.rate)
[1] "OOB" "Active" "Inactive"
> plot(mdrr.crf)

重要度出力と解釈
• Petal.LengthとPetal.Widthの重要度が高い
– 判別においての重要な変数

交差検証関数rfcv
randomForestパッケージ 4.6-1以降
• mtryの数を変化させてのerr.rateを算出
Irisデータセット + ダミーの508変数 mdrrデータセット(caret)

randomForestsパッケージの実装
• Fortranのラッパーで独自関数を追加
• 当時の制限?がそのまま残っている
$ tar tvfz randomForest_4.6-2.tar.gz
-rw-r--r-- hornik/users 17405 2010-12-20 22:48 randomForest/src/classTree.c
-rw-r--r-- hornik/users 14605 2010-12-20 22:48 randomForest/src/regrf.c
-rw-r--r-- hornik/users 10187 2010-12-20 22:48 randomForest/src/regTree.c
-rw-r--r-- hornik/users 23818 2010-12-20 22:48 randomForest/src/rf.c
-rw-r--r-- hornik/users 4708 2010-12-20 22:48 randomForest/src/rf.h
-rw-r--r-- hornik/users 16275 2010-12-20 22:48 randomForest/src/rfsub.f
-rw-r--r-- hornik/users 9201 2010-12-20 22:48 randomForest/src/rfutils.c

> getS3method("randomForest", "default"))
…
if (maxcat > 32)
stop("Can not handle categorical predictors with more than
32 categories.")
$ more rfsub.f
…
double precision tclasspop(nclass), classpop(nclass, nrnodes),
1 tclasscat(nclass, 32), win(nsample), wr(nclass),
1 wl(nclass), tgini(mdim), xrand

compilerパッケージを試す
• RFの関数群の場合はあまり恩恵はなさそう

randomForest 4.6-2
> library(compiler)
> library(caret)
> data(mdrr)
> rfcvbc = cmpfun(rfcv)
> system.time(r1 <- rfcv(mdrrDescr, mdrrClass))
ユーザシステム経過
53.583 0.176 54.486
> system.time(r2 <- rfcvbc(mdrrDescr, mdrrClass))
ユーザシステム経過
52.907 0.168 53.075

Random Forests
at useR!2008 (ドルトムント)
• UseR!2008ではRFの発表が多かった
– 高次元データ解析の手法として紹介(Lassoが主)
– eQTL解析
– メタボローム解析
• このときの座長が

– eNose(匂いの判別)
– 重要度についての考察(Conditional Inference forests)
• 単なる一アプリケーションのユーザ会ではなかった

その他関連論文
• Random Survival Forests (randomSurvivalForestパッケージ)
• Conditional inference forests (partyパッケージcforest関数)
– Strobl C et al, Conditional variable importance for random forests, BMC
Bioinformatics. 2008 Jul 11;9:307
• RF++ (http://sourceforge.net/projects/rfpp/)
– Yuliya V et al, An Introspective Comparison of Random Forest-Based
Classifiers for the Analysis of Cluster-Correlated Data by Way of RF++,
PLoS One. 2009 Sep 18;4(9):e7087
• Semi supervised RF
– Teramoto R, Prediction of Alzheimer's diagnosis using semi-supervised
distance metric learning with label propagation, Comput Biol Chem.
2008 Dec;32(6):438-41.
• Logic Forest (LogicForestパッケージ)
– Wolf BJ et al, Logic Forest: an ensemble classifier for discovering
logical combinations of binary markers., Bioinformatics. 2010 Sep
1;26(17):2183-9.

参考文献
• Breiman L, Random forests. Machine Learning 2001, 45:5-32.
• Andy L et al, Classification and Regression by randomForest, R
News (2002) Vol. 2/3
• Hastie et al, The Elements of Statistical Learning: Data Mining,
Inference, and Prediction, Second Edition (2009), 978-0387848570
• 江口真透,ゲノムデータ・オミックスデータを解析するための新しい統計
方法と機械学習の方法,日本計量生物学会(2009)
• 金明哲, Rによるデータサイエンス, 森北出版 (2007), 978-4627096011
• 岡田他, Rパッケージガイドブック, 東京図書(2011), 978-4489020971

ご清聴ありがとうございました

Random Forests 不徹底入門

Recommandé

Recommandé

Contenu connexe

Dernier

Dernier (7)

En vedette

En vedette (20)

Random Forests 不徹底入門