Elements of Statistical Learning 読み会第2章

The Elements of Statistical Learning
Ch.2: Overview of Supervised Learning
4/13/2017 坂間毅

2
• Supervised Learning
• Predict outputs from inputs
• Inputsの別名
• Predictors 予測変数
• Independent variables 独立変数
• Features 特徴
• Outputsの別名
• Responses 応答変数
• Dependent variables 従属変数
2.1 Introduction

3
• Outputs
1. Quantitative variable
• 大気の測定値など、連続値
• Quantitative prediction = Regression
2. Qualitative variable
• Categorical, discrete variableともいう
• アヤメの種類など、有限集合の値
• Qualitative prediction = Classification
• Inputの種類
1. Quantitative variable
2. Qualitative variable
3. Ordered categorical variable (eg. small, mid, large)
※ 間隔尺度と比例尺度は量的変数にまとめられている？
2.2 Variable Types and Terminology

4
• Notation
• Input
• Vector: 𝑋
• Component of vector: 𝑋𝑗
• i-th observation: 𝑥𝑖 （小文字）
• Matrix: 𝐗 （ボールド）
• All the observations on j-th variable: 𝐱𝐣 (ボールド）
• Output
• Quantitative output: 𝑌
• Prediction of 𝑌: 𝑌
• Qualitative output: 𝐺
• Prediction of 𝐺: 𝐺
2.2 Variable Types and Terminology (contd.)

5
• Linear Model
• With bias term in coefficient, 𝑌 = 𝑋 𝑇 𝛽
• Most popular Fitting method: least squares
• 𝑅𝑆𝑆 𝛽 = 𝐲 − 𝐗𝛽 𝑇 𝐲 − 𝐗𝛽
(RSS: Residual Sum of Squared errors)
• By differentiating RSS w.r.t. 𝛽, and set 0
• 𝐗 𝑇
𝒚 − 𝐗𝛽 = 0
• If 𝐗 𝑇 𝐗 is nonsingular (regular 正則行列), then inverse exists,
• 𝛽 = (𝐗 𝑇 𝐗)−1 𝐗 𝑇 𝐲
2.3.1 Linear Models and Least Squares

6
• Linear Model (Classification)
• 𝑮 = ORANGE if 𝑌 > 0.5
BLUE if 𝑌 ≤ 0.5
• Two classes are separated by Decision boundary
• 𝑥: 𝑥 𝑇 𝛽 = 0.5
• Two cases for generating 2-class data
1. 平均が異なる相関の無い2変数ガウス分布からそれぞれ生成される
⇒線形の決定境界が最善（第四章で）
2. それぞれの平均の分布がガウス分布になっている、10個の分散の小さいガ
ウス分布から生成される
⇒非線形の決定境界が最善（本章の例はこちら）
2.3.1 Linear Models and Least Squares (contd.)

7
• k-Nearest Neighbor
• 𝑌 𝑥 =
1
𝑘 𝑥 𝑖∈𝑁 𝑘(𝑥) 𝑦𝑖
𝑁𝑘 𝑥 is k (Euclidean) closest points to x in training set
• 𝑘 = 1: Voronoi tessellation
• Notice
• Effective number of parameters of k-NN = N/k
• “we will see”
• RSS is useless
• 𝑘 = 1のとき訓練データを誤差なく分類するので、𝑘 = 1がもっともRSSが
少ないことになる
2.3.2 Nearest-Neighbor Methods

8
• Today’s popular techniques are variants of Linear model
or k-Nearest Neighbor (or both)
2.3.3 From Least Squares to Nearest Neighbors
Variance Bias
Linear Model low high
k-Nearest Neighbors high low

9
• Theoretical Framework
• Joint distribution Pr 𝑋, 𝑌
• Squared error loss function 𝐿 𝑌, 𝑓 𝑋 = (𝑌 − 𝑓 𝑋 )2
• Expected (squared) prediction error
• EPE 𝑓 = E(𝑌 − 𝑓 𝑋 )2
= 𝑦 − 𝑓(𝑥) 2Pr(𝑑𝑥, 𝑑𝑦)
= 𝑦 − 𝑓(𝑥) 2 Pr 𝑥, 𝑦 𝑑𝑦 𝑑𝑥
= 𝑦 − 𝑓(𝑥) 2 Pr 𝑦 𝑥 Pr(𝑥 𝑑𝑦 𝑑𝑥
by Pr 𝑋, 𝑌 = Pr 𝑌 𝑋 Pr(𝑋)
= E 𝑌|𝑋 𝑌 − 𝑓(𝑋) 2|𝑋 = 𝑥 Pr(𝑥) 𝑑𝑥
= E 𝑋E 𝑌|𝑋 𝑌 − 𝑓(𝑋) 2
|𝑋
2.4 Statistical Decision Theory

10
• Minimum 𝑓 is the regression function
• The best prediction of 𝑌 at any point 𝑋 = 𝑥 is the conditional mean,
when best is measured by average squared error.
• 𝑓 𝑥 = argmin 𝑐E 𝑌|𝑋 𝑌 − 𝑐 2
|𝑋 = 𝑥
⇒
𝜕
𝜕𝑓
E 𝑌|𝑋 𝑌 − 𝑓(𝑋) 2
|𝑋 = 𝑥 = 0
⇒
𝜕
𝜕𝑓
𝑦 − 𝑓(𝑥) 2Pr(𝑦|𝑥) 𝑑𝑦 = 0
⇒ −2𝑦 + 2𝑓(𝑥) Pr 𝑦 𝑥 𝑑𝑦 = 0
⇒ 2𝑓 𝑥 Pr 𝑦 𝑥 𝑑𝑦 = 2 𝑦𝑃𝑟 𝑦 𝑥 𝑑𝑦
⇒ 𝑓 𝑥 = E(𝑌|𝑋 = 𝑥)
2.4 Statistical Decision Theory (contd.)

11
• How to estimate the conditional mean E(𝑌|𝑋 = 𝑥)
• k-Nearest Neighbor
• 𝑓(𝑥) = Ave(𝑦𝑖|𝑥𝑖 ∈ 𝑁𝑘 𝑥 )
• Two approximation: Ave, 𝑁𝑘(𝑥)
• Under mild regularity condition on Pr(𝑋, 𝑌),
• If 𝑁, 𝑘 → ∞ with
𝑘
𝑁
→ 0, then 𝑓 𝑥 → E(𝑌|𝑋 = 𝑥)
• However, the curse of dimensionality becomes severe

12
• How to estimate the conditional mean E(𝑌|𝑋 = 𝑥)
• Linear Regression
• 𝑓 𝑥 ≈ 𝑥 𝑇 𝛽 (or 𝑓 𝑥 = 𝑥 𝑇 𝛽?)
• Then,
•
𝜕EPE
𝜕𝛽
=
𝜕
𝜕𝛽
𝑦 − 𝑥 𝑇 𝛽 2 Pr 𝑥, 𝑦 𝑑𝑥𝑑𝑦
= 2 𝑦 − 𝑥 𝑇 𝛽 −𝑥 Pr 𝑥, 𝑦 𝑑𝑥 𝑑𝑦
= −2 𝑦 − 𝑥 𝑇 𝛽 𝑥𝑃𝑟 𝑥, 𝑦 𝑑𝑥𝑑𝑦
= −2 𝑦𝑥 − 𝑥𝑥 𝑇
𝛽 Pr 𝑥, 𝑦 𝑑𝑥𝑑𝑦
⇒ 𝑦𝑥Pr(𝑥, 𝑦)𝑑𝑥 𝑑𝑦 = 𝑥𝑥 𝑇 𝛽 Pr 𝑥, 𝑦 𝑑𝑥 𝑑𝑦
⇒𝛽 = E(𝑋𝑋 𝑇
) −1
E 𝑋𝑌
• This is not conditioned on X.
• Based on 𝐿1 loss function,
• EFE 𝑓 = E 𝑌 − 𝑓(𝑋)
• 𝑓 𝑥 = median(𝑌|𝑋 = 𝑥)

13
• In classification
• Zero-one loss function 𝐿 is represented by matrix 𝐋:
• 𝐋 =
0 ⋯ 𝛿1𝐾
𝛿21
⋮
⋱
𝛿2𝐾
⋮
𝛿 𝐾1 ⋯ 0
where 𝛿𝑖𝑗 ∈ 0,1 , K = card(ℊ)
• The Expected prediction error:
• EPE( 𝐺) = E 𝐿 𝐺, 𝐺(𝑋)
= E 𝑋 𝑘=1
𝐾
𝐿 ℊ 𝑘, 𝐺(𝑋) Pr(ℊ 𝑘|𝑋)

14
• In classification
• Minimum 𝐺 (at a point 𝑋 = 𝑥) is the Bayes classifier.
• 𝐺 𝑥 = argmin 𝑔∈ℊ 𝑘=1
𝐾
𝐿( ℊ 𝑘, 𝑔)Pr(ℊ 𝑘|𝑋 = 𝑥)
= argmin 𝑔∈ℊ 1 − Pr(𝑔|𝑋 = 𝑥)
= ℊ 𝑘 if Pr ℊ 𝑘 𝑋 = 𝑥 = max 𝑔∈ℊ Pr 𝑔 𝑋 = 𝑥
• This classifies to the most probable class, using the
conditional distribution Pr(𝐺|𝑋).
• Many approaches to modeling Pr 𝐺 𝑋 are discussed in Ch.4.

15
• The curse of dimensionality
1. If we want to include 10% of data in the neighbor, the
expected required rate of data in 10 dimensions is
𝑒10 0.1 = 0.8
2. Suppose a nearest-neighbor estimate at the origin, in 𝑁 data
uniformly distributed in 𝑝-dimensional unit ball
• The median distance to the closest data point
• 𝑑 𝑝, 𝑁 = 1 −
1
2
1 𝑁 1 𝑝
• If N = 500, 𝑝 = 10, then 𝑑 𝑝, 𝑁 ≈ 0.52
• more than half data points are closer to the boundary
2.5 Local Methods in High Dimensions

16
3. The sampling density is proportional to 𝑁1 𝑝
• 𝑁10 = 10010
• Sparseness in high dimension
4. Examples 𝑥𝑖 uniformly from −1.1 𝑝
• Assume 𝑌 = 𝑓 𝑋 = 𝑒−8 𝑋 2
• Using 1-Nearest Neighbor estimation at 𝑥0 = 0
• 𝑓 𝑥0 < 0 if 𝑥0 ≠ 0
• If the dimension increase,
the nearest neighbor get further
from the target point
2.5 Local Methods in High Dimensions (contd.)

17
5. In linear model 𝑌 = 𝑋 𝑇
𝛽 + 𝜀, 𝜀~𝑁(0, 𝜎2
)
• For arbitrary test set 𝑥0,
• EPE 𝑥0 = E 𝑦0|𝑥0
ET(𝑦0 − 𝑦0)2
= 𝜎2 + E 𝑇 𝑥 𝑜
𝑇(𝐗 𝑇 𝐗)−1 𝑥 𝑜 𝜎2 + 02
• If 𝑁 is large, 𝑇 were selected at random, E 𝑋 = 0,
E 𝑥0
EPE 𝑥0 ~𝜎2( 𝑝 𝑁) + 𝜎2
• If 𝑁 is large or 𝜎2
is small, EPE does not significantly
increases linearly as 𝑝 increases.
⇒ We can avoid the curse of dimensionality in this
restriction.
2.5 Local Methods in High Dimensions (contd.)

18
• Additive model
• 𝑌 = 𝑓 𝑋 + 𝜀
• Deterministic: 𝑓 𝑥 = E(𝑌|𝑋 = 𝑥)
• Anything non-deterministic goes to the random error 𝜀
• E 𝜀 = 0
• 𝜀 is independent of 𝑋
• Additive model cannot be used in the classification
• Target function 𝑝 𝑋 = Pr(𝐺|𝑋), the conditional density
2.6.1 A Statistical Model for the Joint Distribution Pr(𝑋, 𝑌)

19
• Learn 𝑓 𝑋 by example through teacher
• Training set are pair of inputs and outputs
• 𝑇 = 𝑥𝑖, 𝑦𝑖 for 𝑖 = 1, … , 𝑁
• Learning by example
1. Produce 𝑓 𝑥𝑖
2. Compute differences 𝑦𝑖 − 𝑓 𝑥𝑖
3. Modify 𝑓 𝑥𝑖
※ここまでも上記の考えは使ってきたと思うが、ここになってなぜ言い出し
たのか？
2.6.2 Supervised Learning

20
• Data point 𝑥𝑖, 𝑦𝑖 is viewed as a point in a 𝑝 + 1-
dimention Euclidean space
• Approximate Parameter 𝜃
• Linear model
• Linear basis expansions: 𝑓𝜃 𝑥 = 𝑘=1
𝐾
ℎ 𝑘(𝑥)𝜃 𝑘
• Criterion for approximation
1. The Residual sum-of-squares
• 𝑅𝑆𝑆 𝜃 = 𝑖=1
𝑁
𝑦𝑖 − 𝑓𝜃(𝑥𝑖) 2
• For linear model, we get
a simple closed form solution
2.6.3 Function Approximation

21
• Criterion for approximation
2. Maximum likelihood estimation
• 𝐿 𝜃 = 𝑖=1
𝑁
logPr 𝜃 (𝑦𝑖)
• The Principle of Maximum Likelihood:
• Most reasonable 𝜃 are for which the probability of the
observed sample is largest
• In classification, use cross-entropy with Pr 𝐺 = ℊ 𝑘 𝑋 = 𝑥 =
𝑝 𝑘,𝜃(𝑥)
• 𝐿 𝜃 = 𝑖=1
𝑁
log 𝑝 𝑔𝑖,𝜃(𝑥𝑖)
2.6.3 Function Approximation (contd.)

22
• Infinitely many function fits the training data
• The training sets (𝑥𝑖, 𝑦𝑖) are finite, so infinitely many 𝑓 fits them
• Constraint comes from consideration outside of the data
• The strength of the constraint (complexity) can be viewed as the
neighborhood size
• Constraint comes from the metric of the neighbors
• Especially, to overcome the curse of dimensionality, we need
non-isotropic neighborhoods
2.7.1 Difficulty of the Problem

23
• Variety of nonparametric regression techniques
• Add roughness penalty (regularization) term to RSS
• PRSS 𝑓; 𝜆 = RSS 𝑓 + 𝜆𝐽(𝑓)
• Penalty functional 𝐽 can be used to impose special structure
• Additive models with smooth coordinate (feature) functions
• 𝑗=1
𝑝
𝑓𝑗 𝑋𝑗 + 𝑗=1
𝑝
𝐽(𝑓𝑗)
• Projection pursuit regression
• PPR 𝑋 = 𝑚=1
𝑀
𝑔 𝑚(𝛼 𝑚
𝑇 𝑋)
• For more on penalty, see Ch.5
• For Bayesian approach, see Ch.8
2.8.1 Roughness Penalty and Bayesian methods

24
• Kernel methods specify the nature of local neighborhood
• The local neighborhood is specified by a kernel function
• Gaussian kernel is based on: 𝐾𝜆 𝑥0, 𝑥 =
1
𝜆
exp −
𝑥−𝑥0
2
2𝜆
• In general, a local regression estimate is 𝑓 𝜃 𝑥0 , where
• 𝜃 = argmin 𝜃RSS 𝑓𝜃, 𝑥0
= argmin 𝜃 𝑖=1
𝑁
𝐾𝜆(𝑥0, 𝑥𝑖) (𝑦𝑖 − 𝑓𝜃 𝑥𝑖 )2
• For more on this, see Ch.6
2.8.2 Kernel Methods and Local Regression

25
• This class includes a wide variety of methods
1. The model for 𝑓 is a linear expansion of basis functions ℎ𝑖(𝑥)
• 𝑓𝜃 𝑥 = 𝑚=1
𝑀
𝜃 𝑚ℎ 𝑚(𝑥)
• For more, see Sec.5.2, Ch.9
2. Radial basis functions are symmetric 𝑝-dimensional kernels
• 𝑓𝜃 𝑥 = 𝑚=1
𝑀
𝐾𝜆 𝑚
(𝜇 𝑚, 𝑥)𝜃 𝑚
• For more, see Sec.6.7
3. Feed-forward neural network (single layer)
• 𝑓𝜃 𝑥 = 𝑚=1
𝑀
𝛽 𝑚 𝜎(𝛼 𝑚
𝑇 𝑥 + 𝑏 𝑚) where 𝜎 is the sigmoid function
• For more, see Ch.11
• Dictionary methods mean to choose basis function adaptively
2.8.3 Basis Functions and Dictionary methods

26
• Many models have a smoothing or complexity parameter
• We cannot determine it with residual sum-of-squares on training
data
• Residuals will be zero and model will overfit
• The expected prediction error at 𝑥0 (test, generalization error)
• EPE 𝑘 𝑥0 = E 𝑌 − 𝑓𝑘 𝑥0
2
|𝑋 = 𝑥0
= 𝜎2
+ Bias2
( 𝑓(𝑥0)2
+Var 𝑇( 𝑓𝑘 𝑥0 )
= 𝜎2
+ 𝑓 𝑥0 −
1
𝑘 𝑙=1
𝑘
𝑓(𝑥 𝑙 )
2
+
𝜎2
𝑘
= 𝑇1 + 𝑇2 + 𝑇3
• 𝑇1: irreducible error, beyond our control
• 𝑇2: (Squared) Bias term of mean squared error
• 𝑇2 increases with 𝑘
• 𝑇3: Variance term of mean squared error
• 𝑇3 decreases with 𝑘
2.9 Model Selection and the Bias-Variance Tradeoff

27
• Model Complexity
• If model complexity increases,
• (Squared) Bias Term 𝑇2 decreases
• Variance Term 𝑇3 increases
• There is a trade-off between Bias and Variance
• The training error is not a good estimate of test error
• For more, see Ch.7.
2.9 Model Selection and the Bias-Variance Tradeoff (contd.)

Elements of Statistical Learning 読み会第2章

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Elements of Statistical Learning 読み会第2章

Similaire à Elements of Statistical Learning 読み会第2章 (20)

Dernier

Dernier (20)