C2.5

Machine
Learning
Specializa0on

Lasso Regression:
Regularization for feature selection
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
1 ©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on

Feature selection task
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
3

Eﬃciency:
-  If size(w) = 100B, each prediction is expensive
-  If ŵ sparse , computation only depends on # of non-zeros
Interpretability:
-  Which features are relevant for prediction?
Why might you want to perform
feature selection?
©2015
Emily
Fox
&
Carlos
Guestrin
3
many zeros
=
X
ˆwj 6=0
ŷi = ŵj hj(xi)

Machine
Learning
Specializa0on
4

Sparsity:
Housing application
$ ?
Lot
size

Single
Family

Year
built

Last
sold
price

Last
sale
price/sqI

Finished
sqI

Unfinished
sqI

Finished
basement
sqI

#
floors

Flooring
types

Parking
type

Parking
amount

Cooling

Hea0ng

Exterior
materials

Roof
type

Structure
style

Dishwasher

Garbage
disposal

Microwave

Range
/
Oven

Refrigerator

Washer

Dryer

Laundry
loca0on

Hea0ng
type

JeYed
Tub

Deck

Fenced
Yard

Lawn

Garden

Sprinkler
System

Lot
size

Single
Family

Year
built

Last
sold
price

Last
sale
price/sqI

Finished
sqI

Unfinished
sqI

Finished
basement
sqI

#
floors

Flooring
types

Parking
type

Parking
amount

Cooling

Hea0ng

Exterior
materials

Roof
type

Structure
style

Dishwasher

Garbage
disposal

Microwave

Range
/
Oven

Refrigerator

Washer

Dryer

Laundry
loca0on

Hea0ng
type

JeYed
Tub

Deck

Fenced
Yard

Lawn

Garden

Sprinkler
System

…
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
5

Sparsity:
Reading your mind
©2015
Emily
Fox
&
Carlos
Guestrin

very sad very happy
Activity in which
brain regions
can predict
happiness?

Machine
Learning
Specializa0on

Option 1: All subsets
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
7

Find best model of size: 0
©2015
Emily
Fox
&
Carlos
Guestrin

# of features
RSS(ŵ)
0
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  ﬂoors
-  year built
-  year renovated
-  waterfront

Machine
Learning
Specializa0on
8

©2015
Emily
Fox
&
Carlos
Guestrin

# of features
RSS(ŵ)
0 1
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  ﬂoors
-  year built
-  year renovated
-  waterfront

Machine
Learning
Specializa0on
9

©2015
Emily
Fox
&
Carlos
Guestrin

# of features
RSS(ŵ)
0 1
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  ﬂoors
-  year built
-  year renovated
-  waterfront

Machine
Learning
Specializa0on
10

©2015
Emily
Fox
&
Carlos
Guestrin

# of features
RSS(ŵ)
0 1
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  ﬂoors
-  year built
-  year renovated
-  waterfront

Machine
Learning
Specializa0on
11

©2015
Emily
Fox
&
Carlos
Guestrin

# of features
RSS(ŵ)
0 1
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  ﬂoors
-  year built
-  year renovated
-  waterfront

Machine
Learning
Specializa0on
12

©2015
Emily
Fox
&
Carlos
Guestrin

# of features
RSS(ŵ)
0 1
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  ﬂoors
-  year built
-  year renovated
-  waterfront

Machine
Learning
Specializa0on
13

©2015
Emily
Fox
&
Carlos
Guestrin

# of features
RSS(ŵ)
0 1
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  ﬂoors
-  year built
-  year renovated
-  waterfront

Machine
Learning
Specializa0on
14

©2015
Emily
Fox
&
Carlos
Guestrin

# of features
RSS(ŵ)
0 1
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  ﬂoors
-  year built
-  year renovated
-  waterfront

Machine
Learning
Specializa0on
15

©2015
Emily
Fox
&
Carlos
Guestrin

# of features
RSS(ŵ)
0 1
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  ﬂoors
-  year built
-  year renovated
-  waterfront

Machine
Learning
Specializa0on
16

©2015
Emily
Fox
&
Carlos
Guestrin

# of features
RSS(ŵ)
0 1
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  ﬂoors
-  year built
-  year renovated
-  waterfront

Machine
Learning
Specializa0on
17

©2015
Emily
Fox
&
Carlos
Guestrin

# of features
RSS(ŵ)
0 1 2
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  ﬂoors
-  year built
-  year renovated
-  waterfront

Machine
Learning
Specializa0on
18

Note: Not necessarily nested!
©2015
Emily
Fox
&
Carlos
Guestrin

# of features
RSS(ŵ)
0 1 2
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  ﬂoors
-  year built
-  year renovated
-  waterfront

Machine
Learning
Specializa0on
19

Note: Not necessarily nested!
©2015
Emily
Fox
&
Carlos
Guestrin

# of features
RSS(ŵ)
0 1 2
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  ﬂoors
-  year built
-  year renovated
-  waterfront

Machine
Learning
Specializa0on
20

©2015
Emily
Fox
&
Carlos
Guestrin

# of features
RSS(ŵ)
0 1 2 3
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  ﬂoors
-  year built
-  year renovated
-  waterfront

Machine
Learning
Specializa0on
21

©2015
Emily
Fox
&
Carlos
Guestrin

# of features
RSS(ŵ)
0 1 2 3 4
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  ﬂoors
-  year built
-  year renovated
-  waterfront

Machine
Learning
Specializa0on
22

©2015
Emily
Fox
&
Carlos
Guestrin

# of features
RSS(ŵ)
0 1 2 3 4 5
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  ﬂoors
-  year built
-  year renovated
-  waterfront

Machine
Learning
Specializa0on
23

©2015
Emily
Fox
&
Carlos
Guestrin

# of features
RSS(ŵ)
0 1 2 3 4 5 6
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  ﬂoors
-  year built
-  year renovated
-  waterfront

Machine
Learning
Specializa0on
24

©2015
Emily
Fox
&
Carlos
Guestrin

# of features
RSS(ŵ)
0 1 2 3 4 5 6 7
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  ﬂoors
-  year built
-  year renovated
-  waterfront

Machine
Learning
Specializa0on
25

©2015
Emily
Fox
&
Carlos
Guestrin

# of features
RSS(ŵ)
0 1 2 3 4 5 6 7 8
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  ﬂoors
-  year built
-  year renovated
-  waterfront
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  ﬂoors
-  year built
-  year renovated
-  waterfront

Machine
Learning
Specializa0on
26

Choosing model complexity?
Option 1: Assess on validation set
Option 2: Cross validation
Option 3+: Other metrics for penalizing
model complexity like BIC…
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
27

Complexity of “all subsets”
How many models were evaluated?
-  each indexed by features included
©2015
Emily
Fox
&
Carlos
Guestrin

yi = εi
yi = w0h0(xi) + εi
yi = w1 h1(xi) + εi
yi = w0h0(xi) + w1 h1(xi) + εi
yi = w0h0(xi) + w1 h1(xi) + … + wD hD(xi)+ εi
……
[0 0 0 … 0 0 0]
[1 0 0 … 0 0 0]
[0 1 0 … 0 0 0]
[1 1 0 … 0 0 0]
[1 1 1 … 1 1 1]
……
2D
28 = 256
230 = 1,073,741,824
21000 = 1.071509 x 10301
2100B = HUGE!!!!!!
Typically,
computationally
infeasible

Machine
Learning
Specializa0on

Option 2: Greedy algorithms
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
29

Forward stepwise algorithm
1.  Pick a dictionary of features {h0(x),…,hD(x)}
- e.g., polynomials for linear regression
2.  Greedy heuristic:
i.  Start with empty set of features F0 = ∅
(or simple set, like just h0(x)=1 à yi= w0+εi)
ii.  Fit model using current feature set Ft to get ŵ(t)
iii.  Select next best feature hj*(x)
-  e.g., hj(x) resulting in lowest training error
when learning with Ft + {hj(x)}
iv.  Set Ft+1 ç Ft + {hj*(x)}
v.  Recurse
29
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
30

Visualizing greedy procedure
©2015
Emily
Fox
&
Carlos
Guestrin

# of features
RSS(ŵ)
0 1 2 3 4 5 6 7 8
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  ﬂoors
-  year built
-  year renovated
-  waterfront

Machine
Learning
Specializa0on
31

©2015
Emily
Fox
&
Carlos
Guestrin

# of features
RSS(ŵ)
0 1 2 3 4 5 6 7 8
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  ﬂoors
-  year built
-  year renovated
-  waterfront

Machine
Learning
Specializa0on
32

©2015
Emily
Fox
&
Carlos
Guestrin

# of features
RSS(ŵ)
0 1 2 3 4 5 6 7 8
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  ﬂoors
-  year built
-  year renovated
-  waterfront

Machine
Learning
Specializa0on
33

©2015
Emily
Fox
&
Carlos
Guestrin

# of features
RSS(ŵ)
0 1 2 3 4 5 6 7 8
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  ﬂoors
-  year built
-  year renovated
-  waterfront

Machine
Learning
Specializa0on
34

-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  ﬂoors
-  year built
-  year renovated
-  waterfront
©2015
Emily
Fox
&
Carlos
Guestrin

# of features
RSS(ŵ)
0 1 2 3 4 5 6 7 8

Machine
Learning
Specializa0on
35

©2015
Emily
Fox
&
Carlos
Guestrin

# of features
RSS(ŵ)
0 1 2 3 4 5 6 7 8
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  ﬂoors
-  year built
-  year renovated
-  waterfront

Machine
Learning
Specializa0on
36

©2015
Emily
Fox
&
Carlos
Guestrin

# of features
RSS(ŵ)
0 1 2 3 4 5 6 7 8
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  ﬂoors
-  year built
-  year renovated
-  waterfront
error never increases
+
solutions eventually meet

Machine
Learning
Specializa0on
37

When do we stop?
When training error is low enough?
When test error is low enough?
Use validation set or cross validation!
37
©2015
Emily
Fox
&
Carlos
Guestrin

No!
No!

Machine
Learning
Specializa0on
38

Complexity of forward stepwise
How many models were evaluated?
-  1st step, D models
-  2nd step, D-1 models (add 1 feature out of D-1 possible)
-  3rd step, D-2 models (add 1 feature out of D-2 possible)
-  …
How many steps?
-  Depends
-  At most D steps (to full model)
©2015
Emily
Fox
&
Carlos
Guestrin

O(D2) << 2D
for large D

Machine
Learning
Specializa0on
39

Other greedy algorithms
Instead of starting from simple model
and always growing…
Backward stepwise:
Start with full model and iteratively remove
features least useful to ﬁt
Combining forward and backward steps:
In forward algorithm, insert steps to remove
features no longer as important
Lots of other variants, too.
39
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on

Option 3: Regularize
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
41

Ridge regression:
L2 regularized regression
Total cost =
measure of ﬁt + λ measure of magnitude
of coeﬃcients
©2015
Emily
Fox
&
Carlos
Guestrin

RSS(w)
||w||2=w0
2+…+wD
2
2

Encourages small weights
but not exactly 0

Machine
Learning
Specializa0on
42

0 50000 100000 150000 200000
−1000000100000200000 coefficient paths −− Ridge
coefficient
bedrooms
bathrooms
sqft_living
sqft_lot
floors
yr_built
yr_renovat
waterfront
Coeﬃcient path – ridge
©2015
Emily
Fox
&
Carlos
Guestrin

λ
coeﬃcientsŵj

Machine
Learning
Specializa0on
43

Eﬃciency:
-  If size(w) = 100B, each prediction is expensive
-  If ŵ sparse , computation only depends on # of non-zeros
Interpretability:
-  Which features are relevant for prediction?
Recall sparsity (many ŵj=0)
gives eﬃciency and interpretability
©2015
Emily
Fox
&
Carlos
Guestrin

many zeros
=
X
ˆwj 6=0
ŷi = ŵj hj(xi)

Machine
Learning
Specializa0on
44

Using regularization for
feature selection
Instead of searching over a discrete set of
solutions, can we use regularization?
- Start with full model (all possible features)
- “Shrink” some coeﬃcients exactly to 0
•  i.e., knock out certain features
- Non-zero coeﬃcients indicate “selected” features
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
45

Thresholding ridge coeﬃcients?
Why don’t we just set small ridge coeﬃcients to 0?
©2015
Emily
Fox
&
Carlos
Guestrin

0

Machine
Learning
Specializa0on
46

Selected features for a given threshold value
©2015
Emily
Fox
&
Carlos
Guestrin

0

Machine
Learning
Specializa0on
47

Let’s look at two related features…
©2015
Emily
Fox
&
Carlos
Guestrin

0
Nothing measuring bathrooms was included!

Machine
Learning
Specializa0on
48

If only one of the features had been included…
©2015
Emily
Fox
&
Carlos
Guestrin

0

Machine
Learning
Specializa0on
49

Would have included bathrooms in selected model
©2015
Emily
Fox
&
Carlos
Guestrin

0
Can regularization lead directly to sparsity?

Machine
Learning
Specializa0on
50

Try this cost instead of ridge…
Total cost =
measure of ﬁt + λ measure of magnitude
of coeﬃcients
©2015
Emily
Fox
&
Carlos
Guestrin

RSS(w)
||w||1=|w0|+…+|wD|
Lasso regression
(a.k.a. L1 regularized regression)
Leads to
sparse
solutions!

Machine
Learning
Specializa0on
51

Lasso regression:
L1 regularized regression
Just like ridge regression, solution is
governed by a continuous parameter λ
If λ=0:
If λ=∞:

If λ in between:
RSS(w) + λ||w||1
tuning parameter = balance of ﬁt and sparsity
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
52

0 50000 100000 150000 200000
−1000000100000200000 coefficient paths −− Ridge
coefficient
bedrooms
bathrooms
sqft_living
sqft_lot
floors
yr_built
yr_renovat
waterfront
Coeﬃcient path – ridge
©2015
Emily
Fox
&
Carlos
Guestrin

λ
coeﬃcientsŵj

Machine
Learning
Specializa0on
53

0 50000 100000 150000 200000
−1000000100000200000 coefficient paths −− LASSO
coefficient
bedrooms
bathrooms
sqft_living
sqft_lot
floors
yr_built
yr_renovat
waterfront
Coeﬃcient path – lasso
©2015
Emily
Fox
&
Carlos
Guestrin

λ
coeﬃcientsŵj

Machine
Learning
Specializa0on

Geometric intuition for
sparsity of lasso solution
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on

Geometric intuition for
ridge regression
55 ©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
56

Visualizing the ridge cost in 2D
©2015
Emily
Fox
&
Carlos
Guestrin

NX
i=1
w0
w1
RSS Cost
−10 −5 0 5 10
−10
−5
0
5
10
RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi))2 + λ (w0
2+w1
2)2

Machine
Learning
Specializa0on
57

©2015
Emily
Fox
&
Carlos
Guestrin

NX
i=1
w0
w1
L2 penalty
−10 −5 0 5 10
−10
−5
0
5
10
2+w1
2)2

Machine
Learning
Specializa0on
58

©2015
Emily
Fox
&
Carlos
Guestrin

NX
i=1
2+w1
2)2

Machine
Learning
Specializa0on
59

Visualizing the ridge solution
©2015
Emily
Fox
&
Carlos
Guestrin

NX
i=1
5215
5215
5215
5215
5215
5215
5215
w0
w1
4.75
4.75
level sets intersect
−10 −5 0 5 10
−10
−5
0
5
10
2+w1
2)2

Machine
Learning
Specializa0on

Geometric intuition for lasso
60 ©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
61

Visualizing the lasso cost in 2D
©2015
Emily
Fox
&
Carlos
Guestrin

RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi))2 + λ (|w0|+|w1|)
NX
i=1
w0
w1
RSS Cost
−10 −5 0 5 10
−10
−5
0
5
10

Machine
Learning
Specializa0on
62

©2015
Emily
Fox
&
Carlos
Guestrin

NX
i=1
w0
w1
L1 penalty
−10 −5 0 5 10
−10
−5
0
5
10

Machine
Learning
Specializa0on
63

©2015
Emily
Fox
&
Carlos
Guestrin

NX
i=1

Machine
Learning
Specializa0on
64

Visualizing the lasso solution
©2015
Emily
Fox
&
Carlos
Guestrin

NX
i=1
5215
5215
5215
5215
5215
5215
5215
w0
w1
2.75
2.75
level sets intersect
−10 −5 0 5 10
−10
−5
0
5
10

Machine
Learning
Specializa0on
65

Revisit polynomial ﬁt demo
What happens if we reﬁt our high-order
polynomial, but now using lasso regression?
Will consider a few settings of λ …
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on

Fitting the lasso regression model
(for given λ value)
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
67
©2015
Emily
Fox
&
Carlos
Guestrin

Training
Data
Feature
extraction
ML
model
Quality
metric
ML algorithm
ŵy
ŷh(x)

Machine
Learning
Specializa0on
68

How we optimized past objectives
To solve for ŵ, previously took gradient
of total cost objective and either:
1)  Derived closed-form solution
2)  Used in gradient descent algorithm
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
69

Optimizing the lasso objective
Lasso total cost:
Issues:
1)  What’s the derivative of |wj|?
2)  Even if we could compute derivative, no closed-form solution
©2015
Emily
Fox
&
Carlos
Guestrin

RSS(w) + ||w||1λ
gradients à subgradients
can use subgradient descent

Machine
Learning
Specializa0on

Aside 1: Coordinate descent
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
71

Coordinate descent
Goal: Minimize some function g
Often, hard to ﬁnd minimum for all coordinates, but easy for each coordinate
Coordinate descent:
Initialize ŵ = 0 (or smartly…)
while not converged
pick a coordinate j
ŵj ß
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
72

Comments on coordinate descent
How do we pick next coordinate?
-  At random (“random” or “stochastic” coordinate descent),
round robin, …
No stepsize to choose!
Super useful approach for many problems
-  Converges to optimum in some cases
(e.g., “strongly convex”)
-  Converges for lasso objective
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on

Aside 2: Normalizing features
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
74

p
Normalizing features
Scale training columns (not rows!)
as:
Apply same training scale factors
to test data:
©2015
Emily
Fox
&
Carlos
Guestrin

hj(xk) =
hj(xk)
hj(xi)2
NX
i=1
summing over
training points
apply to
test point
Training
features
Test
features
Normalizer:
zj
Normalizer:
zj
…
hj(xk) =
hj(xk)
p
hj(xi)2
NX
i=1

Machine
Learning
Specializa0on

Aside 3: Coordinate descent for
unregularized regression
(for normalized features)
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
76

Optimizing least squares objective
one coordinate at a time
Fix all coordinates w-j and take partial w.r.t. wj
©2015
Emily
Fox
&
Carlos
Guestrin

RSS(w) = (yi- wjhj(xi))2
NX
i=1
DX
j=0
NX
i=1
RSS(w) = -2 hj(xi)(yi – wjhj(xi))
∂
∂wj
DX
j=0

Machine
Learning
Specializa0on
77

Optimizing least squares objective
Set partial = 0 and solve
©2015
Emily
Fox
&
Carlos
Guestrin

RSS(w) = (yi- wjhj(xi))2
NX
i=1
DX
j=0
RSS(w) = -2ρj + 2wj = 0
∂
∂wj

Machine
Learning
Specializa0on
78

Coordinate descent for
least squares regression
while not converged
for j=0,1,…,D
compute:
set: ŵj = ρj
©2015
Emily
Fox
&
Carlos
Guestrin

NX
i=1
ρj = hj(xi)(yi – ŷi(ŵ-j))
prediction
without feature j
residual
without feature j

Machine
Learning
Specializa0on
80

Coordinate descent for
least squares regression
while not converged
for j=0,1,…,D
compute:
set: ŵj = ρj
©2015
Emily
Fox
&
Carlos
Guestrin

NX
i=1
prediction
without feature j
residual
without feature j

Machine
Learning
Specializa0on
81

while not converged
for j=0,1,…,D
compute:
set: ŵj =
©2015
Emily
Fox
&
Carlos
Guestrin

NX
i=1
ρj + λ/2 if ρj < -λ/2
ρj – λ/2 if ρj > λ/2
0 if ρj in [-λ/2, λ/2]

Machine
Learning
Specializa0on
82

Soft thresholding
©2015
Emily
Fox
&
Carlos
Guestrin

ŵj
ρj
ŵj =
ρj + λ/2 if ρj < -λ/2
0 if ρj in [-λ/2, λ/2]

Machine
Learning
Specializa0on
83

How to assess convergence?
while not converged
for j=0,1,…,D
compute:
set: ŵj =
©2015
Emily
Fox
&
Carlos
Guestrin

NX
i=1
ρj + λ/2 if ρj < -λ/2
0 if ρj in [-λ/2, λ/2]

Machine
Learning
Specializa0on
84

When to stop?
For convex problems, will
start to take smaller and
smaller steps
Measure size of steps
taken in a full loop over
all features
-  stop when max step < ε
Convergence criteria
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
85

Other lasso solvers
©2015
Emily
Fox
&
Carlos
Guestrin

Classically: Least angle regression (LARS) [Efron et al. ‘04]
Then: Coordinate descent algorithm
[Fu ‘98, Friedman, Hastie, & Tibshirani ’08]
Now:
•  Parallel CD (e.g., Shotgun, [Bradley et al. ‘11])
•  Other parallel learning approaches for linear models
-  Parallel stochastic gradient descent (SGD) (e.g., Hogwild! [Niu et al. ’11])
-  Parallel independent solutions then averaging [Zhang et al. ‘12]
•  Alternating directions method of multipliers (ADMM) [Boyd et al. ’11]

Machine
Learning
Specializa0on
87

with normalized features
while not converged
for j=0,1,…,D
compute:
set: ŵj =
©2015
Emily
Fox
&
Carlos
Guestrin

NX
i=1
ρj + λ/2 if ρj < -λ/2
0 if ρj in [-λ/2, λ/2]

Machine
Learning
Specializa0on
88

with unnormalized features
Precompute:
while not converged
for j=0,1,…,D
compute:
set: ŵj =
©2015
Emily
Fox
&
Carlos
Guestrin

NX
i=1
zj = hj(xi)2
NX
i=1
(ρj + λ/2)/zj if ρj < -λ/2
(ρj – λ/2)/zj if ρj > λ/2
0 if ρj in [-λ/2, λ/2]

Machine
Learning
Specializa0on
90

Optimizing lasso objective
Fix all coordinates w-j and take partial w.r.t. wj
©2015
Emily
Fox
&
Carlos
Guestrin

RSS(w) + λ||w||1 = (yi- wjhj(xi))2 + λ |wj|
NX
i=1
DX
j=0
DX
j=0
derive without normalizing features

Machine
Learning
Specializa0on
91

Part 1: Partial of RSS term
©2015
Emily
Fox
&
Carlos
Guestrin

NX
i=1
DX
j=0
DX
j=0
NX
i=1
RSS(w) = -2 hj(xi)(yi – wjhj(xi))
∂
∂wj
DX
j=0

Machine
Learning
Specializa0on
92

Part 2: Partial of L1 penalty term
©2015
Emily
Fox
&
Carlos
Guestrin

NX
i=1
DX
j=0
DX
j=0
λ |wj| = ???∂
∂wj
|x|
x

Machine
Learning
Specializa0on
93

Subgradients of convex functions
Gradients lower bound convex functions:
Subgradients: Generalize gradients to non-diﬀerentiable points:
-  Any plane that lower bounds function
©2015
Emily
Fox
&
Carlos
Guestrin

ba
unique at x if function
diﬀerentiable at x
g(x)
x
|x|
x

Machine
Learning
Specializa0on
94

Part 2: Subgradient of L1 term
©2015
Emily
Fox
&
Carlos
Guestrin

NX
i=1
DX
j=0
DX
j=0
λ∂

|wj| =wj
|wj|
wj
-λ when wj < 0
λ when wj > 0
[-λ,λ] when wj = 0

Machine
Learning
Specializa0on
95

Putting it all together…
©2015
Emily
Fox
&
Carlos
Guestrin

NX
i=1
DX
j=0
DX
j=0
∂

[lasso cost] = 2zjwj – 2ρj +wj
-λ when wj < 0
λ when wj > 0
[-λ, λ] when wj = 0
2zjwj – 2ρj – λ when wj < 0
2zjwj – 2ρj + λ when wj > 0
[-2ρj-λ, -2ρj+λ] when wj = 0=

Machine
Learning
Specializa0on
96

Optimal solution:
Set subgradient = 0
©2015
Emily
Fox
&
Carlos
Guestrin

∂

[lasso cost] = = 0wj
Case 1 (wj < 0):
Case 2 (wj = 0):
Case 3 (wj > 0):
2zjŵj – 2ρj – λ = 0
ŵj = 0
For ŵj < 0, need
For ŵj = 0, need [-2ρj-λ, -2ρj+λ] to contain 0:
2zjŵj – 2ρj + λ = 0 For ŵj > 0, need
[-2ρj-λ, -2ρj+λ] when wj = 0

Machine
Learning
Specializa0on
97

Optimal solution:
Set subgradient = 0
©2015
Emily
Fox
&
Carlos
Guestrin

∂

[lasso cost] = = 0wj
(ρj + λ/2)/zj if ρj < -λ/2
0 if ρj in [-λ/2, λ/2]ŵj =
[-2ρj-λ, -2ρj+λ] when wj = 0

Machine
Learning
Specializa0on
98
©2015
Emily
Fox
&
Carlos
Guestrin

ŵj
ρj
(ρj + λ/2)/zj if ρj < -λ/2
0 if ρj in [-λ/2, λ/2]ŵj =
Soft thresholding

Machine
Learning
Specializa0on
99

Precompute:
while not converged
for j=0,1,…,D
compute:
set: ŵj =
©2015
Emily
Fox
&
Carlos
Guestrin

NX
i=1
zj = hj(xi)2
NX
i=1
(ρj + λ/2)/zj if ρj < -λ/2
0 if ρj in [-λ/2, λ/2]

Machine
Learning
Specializa0on
101

If suﬃcient amount of data…
©2015
Emily
Fox
&
Carlos
Guestrin

Validation
set
Training set
Test
set
ﬁt ŵλ
test performance
of ŵλ to select λ*
assess
generalization
error of ŵλ*

Machine
Learning
Specializa0on
102

For k=1,…,K
1.  Estimate ŵλ
(k) on the training blocks
2.  Compute error on validation block: errork(λ)
Compute average error: CV(λ) = errork(λ)
©2015
Emily
Fox
&
Carlos
Guestrin

Valid
set
ŵλ
(5)
error5(λ)
1
K
KX
k=1
K-fold cross validation

Machine
Learning
Specializa0on
103

Choosing λ via cross validation
Cross validation is choosing the λ that
provides best predictive accuracy
Tends to favor less sparse solutions, and
thus smaller λ, than optimal choice for
feature selection
c.f., “Machine Learning: A Probabilistic Perspective”,
Murphy, 2012 for further discussion
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
105

Debiasing lasso
Lasso shrinks coefficients
relative to LS solution
à  more bias, less variance
Can reduce bias as follows:
1.  Run lasso to select
features
2.  Run least squares
regression with only
selected features
“Relevant” features no longer
shrunk relative to LS fit of
same reduced model
©2015
Emily
Fox
&
Carlos
Guestrin

Figure used with permission of Mario Figueiredo
(captions modified to fit course)
True coefficients (D=4096, non-zero = 160)
L1 reconstruction (non-zero = 1024, MSE = 0.0072)
Debiased (non-zero = 1024, MSE = 3.26e-005)
Least squares (non-zero = 0, MSE = 1.568)

Machine
Learning
Specializa0on
106

Issues with standard lasso objective
1.  With group of highly correlated features, lasso tends to
select amongst them arbitrarily
-  Often prefer to select all together
2.  Often, empirically ridge has better predictive performance
than lasso, but lasso leads to sparser solution
Elastic net aims to address these issues
-  hybrid between lasso and ridge regression
-  uses L1 and L2 penalties
See Zou & Hastie ‘05 for further discussion
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
108

Impact of feature selection and lasso
Lasso has changed machine learning,
statistics, & electrical engineering
But, for feature selection in general, be careful
about interpreting selected features
- selection only considers features included
- sensitive to correlations between features
- result depends on algorithm used
- there are theoretical guarantees for lasso under
certain conditions
©2015
Emily
Fox
&
Carlos
Guestrin

Machine
Learning
Specializa0on
109

What you can do now…
•  Perform feature selection using “all subsets” and “forward
stepwise” algorithms
•  Analyze computational costs of these algorithms
•  Contrast greedy and optimal algorithms
•  Formulate lasso objective
•  Describe what happens to estimated lasso coeﬃcients as
tuning parameter λ is varied
•  Interpret lasso coeﬃcient path plot
•  Contrast ridge and lasso regression
•  Describe geometrically why L1 penalty leads to sparsity
•  Estimate lasso regression parameters using an iterative
coordinate descent algorithm
•  Implement K-fold cross validation to select lasso tuning
parameter λ
©2015
Emily
Fox
&
Carlos
Guestrin

C2.5

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (11)

En vedette

En vedette (20)

Similaire à C2.5

Similaire à C2.5 (7)

Plus de Daniel LIAO

Plus de Daniel LIAO (17)

Dernier

Dernier (20)

C2.5