C2.5
- 1. Machine
Learning
Specializa0on
Lasso Regression:
Regularization for feature selection
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
1 ©2015
Emily
Fox
&
Carlos
Guestrin
- 3. Machine
Learning
Specializa0on
3
Efficiency:
- If size(w) = 100B, each prediction is expensive
- If ŵ sparse , computation only depends on # of non-zeros
Interpretability:
- Which features are relevant for prediction?
Why might you want to perform
feature selection?
©2015
Emily
Fox
&
Carlos
Guestrin
3
many zeros
=
X
ˆwj 6=0
ŷi = ŵj hj(xi)
- 4. Machine
Learning
Specializa0on
4
Sparsity:
Housing application
$ ?
Lot
size
Single
Family
Year
built
Last
sold
price
Last
sale
price/sqI
Finished
sqI
Unfinished
sqI
Finished
basement
sqI
#
floors
Flooring
types
Parking
type
Parking
amount
Cooling
Hea0ng
Exterior
materials
Roof
type
Structure
style
Dishwasher
Garbage
disposal
Microwave
Range
/
Oven
Refrigerator
Washer
Dryer
Laundry
loca0on
Hea0ng
type
JeYed
Tub
Deck
Fenced
Yard
Lawn
Garden
Sprinkler
System
Lot
size
Single
Family
Year
built
Last
sold
price
Last
sale
price/sqI
Finished
sqI
Unfinished
sqI
Finished
basement
sqI
#
floors
Flooring
types
Parking
type
Parking
amount
Cooling
Hea0ng
Exterior
materials
Roof
type
Structure
style
Dishwasher
Garbage
disposal
Microwave
Range
/
Oven
Refrigerator
Washer
Dryer
Laundry
loca0on
Hea0ng
type
JeYed
Tub
Deck
Fenced
Yard
Lawn
Garden
Sprinkler
System
…
©2015
Emily
Fox
&
Carlos
Guestrin
- 5. Machine
Learning
Specializa0on
5
Sparsity:
Reading your mind
©2015
Emily
Fox
&
Carlos
Guestrin
very sad very happy
Activity in which
brain regions
can predict
happiness?
- 7. Machine
Learning
Specializa0on
7
Find best model of size: 0
©2015
Emily
Fox
&
Carlos
Guestrin
# of features
RSS(ŵ)
0
- # bedrooms
- # bathrooms
- sq.ft. living
- sq.ft. lot
- floors
- year built
- year renovated
- waterfront
- 8. Machine
Learning
Specializa0on
8
Find best model of size: 1
©2015
Emily
Fox
&
Carlos
Guestrin
# of features
RSS(ŵ)
0 1
- # bedrooms
- # bathrooms
- sq.ft. living
- sq.ft. lot
- floors
- year built
- year renovated
- waterfront
- 9. Machine
Learning
Specializa0on
9
Find best model of size: 1
©2015
Emily
Fox
&
Carlos
Guestrin
# of features
RSS(ŵ)
0 1
- # bedrooms
- # bathrooms
- sq.ft. living
- sq.ft. lot
- floors
- year built
- year renovated
- waterfront
- 10. Machine
Learning
Specializa0on
10
Find best model of size: 1
©2015
Emily
Fox
&
Carlos
Guestrin
# of features
RSS(ŵ)
0 1
- # bedrooms
- # bathrooms
- sq.ft. living
- sq.ft. lot
- floors
- year built
- year renovated
- waterfront
- 11. Machine
Learning
Specializa0on
11
Find best model of size: 1
©2015
Emily
Fox
&
Carlos
Guestrin
# of features
RSS(ŵ)
0 1
- # bedrooms
- # bathrooms
- sq.ft. living
- sq.ft. lot
- floors
- year built
- year renovated
- waterfront
- 12. Machine
Learning
Specializa0on
12
Find best model of size: 1
©2015
Emily
Fox
&
Carlos
Guestrin
# of features
RSS(ŵ)
0 1
- # bedrooms
- # bathrooms
- sq.ft. living
- sq.ft. lot
- floors
- year built
- year renovated
- waterfront
- 13. Machine
Learning
Specializa0on
13
Find best model of size: 1
©2015
Emily
Fox
&
Carlos
Guestrin
# of features
RSS(ŵ)
0 1
- # bedrooms
- # bathrooms
- sq.ft. living
- sq.ft. lot
- floors
- year built
- year renovated
- waterfront
- 14. Machine
Learning
Specializa0on
14
Find best model of size: 1
©2015
Emily
Fox
&
Carlos
Guestrin
# of features
RSS(ŵ)
0 1
- # bedrooms
- # bathrooms
- sq.ft. living
- sq.ft. lot
- floors
- year built
- year renovated
- waterfront
- 15. Machine
Learning
Specializa0on
15
Find best model of size: 1
©2015
Emily
Fox
&
Carlos
Guestrin
# of features
RSS(ŵ)
0 1
- # bedrooms
- # bathrooms
- sq.ft. living
- sq.ft. lot
- floors
- year built
- year renovated
- waterfront
- 16. Machine
Learning
Specializa0on
16
Find best model of size: 1
©2015
Emily
Fox
&
Carlos
Guestrin
# of features
RSS(ŵ)
0 1
- # bedrooms
- # bathrooms
- sq.ft. living
- sq.ft. lot
- floors
- year built
- year renovated
- waterfront
- 17. Machine
Learning
Specializa0on
17
Find best model of size: 2
©2015
Emily
Fox
&
Carlos
Guestrin
# of features
RSS(ŵ)
0 1 2
- # bedrooms
- # bathrooms
- sq.ft. living
- sq.ft. lot
- floors
- year built
- year renovated
- waterfront
- 18. Machine
Learning
Specializa0on
18
Note: Not necessarily nested!
©2015
Emily
Fox
&
Carlos
Guestrin
# of features
RSS(ŵ)
0 1 2
- # bedrooms
- # bathrooms
- sq.ft. living
- sq.ft. lot
- floors
- year built
- year renovated
- waterfront
- 19. Machine
Learning
Specializa0on
19
Note: Not necessarily nested!
©2015
Emily
Fox
&
Carlos
Guestrin
# of features
RSS(ŵ)
0 1 2
- # bedrooms
- # bathrooms
- sq.ft. living
- sq.ft. lot
- floors
- year built
- year renovated
- waterfront
- 20. Machine
Learning
Specializa0on
20
Find best model of size: 3
©2015
Emily
Fox
&
Carlos
Guestrin
# of features
RSS(ŵ)
0 1 2 3
- # bedrooms
- # bathrooms
- sq.ft. living
- sq.ft. lot
- floors
- year built
- year renovated
- waterfront
- 21. Machine
Learning
Specializa0on
21
Find best model of size: 4
©2015
Emily
Fox
&
Carlos
Guestrin
# of features
RSS(ŵ)
0 1 2 3 4
- # bedrooms
- # bathrooms
- sq.ft. living
- sq.ft. lot
- floors
- year built
- year renovated
- waterfront
- 22. Machine
Learning
Specializa0on
22
Find best model of size: 5
©2015
Emily
Fox
&
Carlos
Guestrin
# of features
RSS(ŵ)
0 1 2 3 4 5
- # bedrooms
- # bathrooms
- sq.ft. living
- sq.ft. lot
- floors
- year built
- year renovated
- waterfront
- 23. Machine
Learning
Specializa0on
23
Find best model of size: 6
©2015
Emily
Fox
&
Carlos
Guestrin
# of features
RSS(ŵ)
0 1 2 3 4 5 6
- # bedrooms
- # bathrooms
- sq.ft. living
- sq.ft. lot
- floors
- year built
- year renovated
- waterfront
- 24. Machine
Learning
Specializa0on
24
Find best model of size: 7
©2015
Emily
Fox
&
Carlos
Guestrin
# of features
RSS(ŵ)
0 1 2 3 4 5 6 7
- # bedrooms
- # bathrooms
- sq.ft. living
- sq.ft. lot
- floors
- year built
- year renovated
- waterfront
- 25. Machine
Learning
Specializa0on
25
Find best model of size: 8
©2015
Emily
Fox
&
Carlos
Guestrin
# of features
RSS(ŵ)
0 1 2 3 4 5 6 7 8
- # bedrooms
- # bathrooms
- sq.ft. living
- sq.ft. lot
- floors
- year built
- year renovated
- waterfront
- # bedrooms
- # bathrooms
- sq.ft. living
- sq.ft. lot
- floors
- year built
- year renovated
- waterfront
- 26. Machine
Learning
Specializa0on
26
Choosing model complexity?
Option 1: Assess on validation set
Option 2: Cross validation
Option 3+: Other metrics for penalizing
model complexity like BIC…
©2015
Emily
Fox
&
Carlos
Guestrin
- 27. Machine
Learning
Specializa0on
27
Complexity of “all subsets”
How many models were evaluated?
- each indexed by features included
©2015
Emily
Fox
&
Carlos
Guestrin
yi = εi
yi = w0h0(xi) + εi
yi = w1 h1(xi) + εi
yi = w0h0(xi) + w1 h1(xi) + εi
yi = w0h0(xi) + w1 h1(xi) + … + wD hD(xi)+ εi
……
[0 0 0 … 0 0 0]
[1 0 0 … 0 0 0]
[0 1 0 … 0 0 0]
[1 1 0 … 0 0 0]
[1 1 1 … 1 1 1]
……
2D
28 = 256
230 = 1,073,741,824
21000 = 1.071509 x 10301
2100B = HUGE!!!!!!
Typically,
computationally
infeasible
- 29. Machine
Learning
Specializa0on
29
Forward stepwise algorithm
1. Pick a dictionary of features {h0(x),…,hD(x)}
- e.g., polynomials for linear regression
2. Greedy heuristic:
i. Start with empty set of features F0 = ∅
(or simple set, like just h0(x)=1 à yi= w0+εi)
ii. Fit model using current feature set Ft to get ŵ(t)
iii. Select next best feature hj*(x)
- e.g., hj(x) resulting in lowest training error
when learning with Ft + {hj(x)}
iv. Set Ft+1 ç Ft + {hj*(x)}
v. Recurse
29
©2015
Emily
Fox
&
Carlos
Guestrin
- 30. Machine
Learning
Specializa0on
30
Visualizing greedy procedure
©2015
Emily
Fox
&
Carlos
Guestrin
# of features
RSS(ŵ)
0 1 2 3 4 5 6 7 8
- # bedrooms
- # bathrooms
- sq.ft. living
- sq.ft. lot
- floors
- year built
- year renovated
- waterfront
- 31. Machine
Learning
Specializa0on
31
Visualizing greedy procedure
©2015
Emily
Fox
&
Carlos
Guestrin
# of features
RSS(ŵ)
0 1 2 3 4 5 6 7 8
- # bedrooms
- # bathrooms
- sq.ft. living
- sq.ft. lot
- floors
- year built
- year renovated
- waterfront
- 32. Machine
Learning
Specializa0on
32
Visualizing greedy procedure
©2015
Emily
Fox
&
Carlos
Guestrin
# of features
RSS(ŵ)
0 1 2 3 4 5 6 7 8
- # bedrooms
- # bathrooms
- sq.ft. living
- sq.ft. lot
- floors
- year built
- year renovated
- waterfront
- 33. Machine
Learning
Specializa0on
33
Visualizing greedy procedure
©2015
Emily
Fox
&
Carlos
Guestrin
# of features
RSS(ŵ)
0 1 2 3 4 5 6 7 8
- # bedrooms
- # bathrooms
- sq.ft. living
- sq.ft. lot
- floors
- year built
- year renovated
- waterfront
- 34. Machine
Learning
Specializa0on
34
- # bedrooms
- # bathrooms
- sq.ft. living
- sq.ft. lot
- floors
- year built
- year renovated
- waterfront
Visualizing greedy procedure
©2015
Emily
Fox
&
Carlos
Guestrin
# of features
RSS(ŵ)
0 1 2 3 4 5 6 7 8
- 35. Machine
Learning
Specializa0on
35
Visualizing greedy procedure
©2015
Emily
Fox
&
Carlos
Guestrin
# of features
RSS(ŵ)
0 1 2 3 4 5 6 7 8
- # bedrooms
- # bathrooms
- sq.ft. living
- sq.ft. lot
- floors
- year built
- year renovated
- waterfront
- 36. Machine
Learning
Specializa0on
36
Visualizing greedy procedure
©2015
Emily
Fox
&
Carlos
Guestrin
# of features
RSS(ŵ)
0 1 2 3 4 5 6 7 8
- # bedrooms
- # bathrooms
- sq.ft. living
- sq.ft. lot
- floors
- year built
- year renovated
- waterfront
error never increases
+
solutions eventually meet
- 37. Machine
Learning
Specializa0on
37
When do we stop?
When training error is low enough?
When test error is low enough?
Use validation set or cross validation!
37
©2015
Emily
Fox
&
Carlos
Guestrin
No!
No!
- 38. Machine
Learning
Specializa0on
38
Complexity of forward stepwise
How many models were evaluated?
- 1st step, D models
- 2nd step, D-1 models (add 1 feature out of D-1 possible)
- 3rd step, D-2 models (add 1 feature out of D-2 possible)
- …
How many steps?
- Depends
- At most D steps (to full model)
©2015
Emily
Fox
&
Carlos
Guestrin
O(D2) << 2D
for large D
- 39. Machine
Learning
Specializa0on
39
Other greedy algorithms
Instead of starting from simple model
and always growing…
Backward stepwise:
Start with full model and iteratively remove
features least useful to fit
Combining forward and backward steps:
In forward algorithm, insert steps to remove
features no longer as important
Lots of other variants, too.
39
©2015
Emily
Fox
&
Carlos
Guestrin
- 41. Machine
Learning
Specializa0on
41
Ridge regression:
L2 regularized regression
Total cost =
measure of fit + λ measure of magnitude
of coefficients
©2015
Emily
Fox
&
Carlos
Guestrin
RSS(w)
||w||2=w0
2+…+wD
2
2
Encourages small weights
but not exactly 0
- 42. Machine
Learning
Specializa0on
42
0 50000 100000 150000 200000
−1000000100000200000 coefficient paths −− Ridge
coefficient
bedrooms
bathrooms
sqft_living
sqft_lot
floors
yr_built
yr_renovat
waterfront
Coefficient path – ridge
©2015
Emily
Fox
&
Carlos
Guestrin
λ
coefficientsŵj
- 43. Machine
Learning
Specializa0on
43
Efficiency:
- If size(w) = 100B, each prediction is expensive
- If ŵ sparse , computation only depends on # of non-zeros
Interpretability:
- Which features are relevant for prediction?
Recall sparsity (many ŵj=0)
gives efficiency and interpretability
©2015
Emily
Fox
&
Carlos
Guestrin
many zeros
=
X
ˆwj 6=0
ŷi = ŵj hj(xi)
- 44. Machine
Learning
Specializa0on
44
Using regularization for
feature selection
Instead of searching over a discrete set of
solutions, can we use regularization?
- Start with full model (all possible features)
- “Shrink” some coefficients exactly to 0
• i.e., knock out certain features
- Non-zero coefficients indicate “selected” features
©2015
Emily
Fox
&
Carlos
Guestrin
- 45. Machine
Learning
Specializa0on
45
Thresholding ridge coefficients?
Why don’t we just set small ridge coefficients to 0?
©2015
Emily
Fox
&
Carlos
Guestrin
0
- 46. Machine
Learning
Specializa0on
46
Thresholding ridge coefficients?
Selected features for a given threshold value
©2015
Emily
Fox
&
Carlos
Guestrin
0
- 47. Machine
Learning
Specializa0on
47
Thresholding ridge coefficients?
Let’s look at two related features…
©2015
Emily
Fox
&
Carlos
Guestrin
0
Nothing measuring bathrooms was included!
- 48. Machine
Learning
Specializa0on
48
Thresholding ridge coefficients?
If only one of the features had been included…
©2015
Emily
Fox
&
Carlos
Guestrin
0
- 49. Machine
Learning
Specializa0on
49
Thresholding ridge coefficients?
Would have included bathrooms in selected model
©2015
Emily
Fox
&
Carlos
Guestrin
0
Can regularization lead directly to sparsity?
- 50. Machine
Learning
Specializa0on
50
Try this cost instead of ridge…
Total cost =
measure of fit + λ measure of magnitude
of coefficients
©2015
Emily
Fox
&
Carlos
Guestrin
RSS(w)
||w||1=|w0|+…+|wD|
Lasso regression
(a.k.a. L1 regularized regression)
Leads to
sparse
solutions!
- 51. Machine
Learning
Specializa0on
51
Lasso regression:
L1 regularized regression
Just like ridge regression, solution is
governed by a continuous parameter λ
If λ=0:
If λ=∞:
If λ in between:
RSS(w) + λ||w||1
tuning parameter = balance of fit and sparsity
©2015
Emily
Fox
&
Carlos
Guestrin
- 52. Machine
Learning
Specializa0on
52
0 50000 100000 150000 200000
−1000000100000200000 coefficient paths −− Ridge
coefficient
bedrooms
bathrooms
sqft_living
sqft_lot
floors
yr_built
yr_renovat
waterfront
Coefficient path – ridge
©2015
Emily
Fox
&
Carlos
Guestrin
λ
coefficientsŵj
- 53. Machine
Learning
Specializa0on
53
0 50000 100000 150000 200000
−1000000100000200000 coefficient paths −− LASSO
coefficient
bedrooms
bathrooms
sqft_living
sqft_lot
floors
yr_built
yr_renovat
waterfront
Coefficient path – lasso
©2015
Emily
Fox
&
Carlos
Guestrin
λ
coefficientsŵj
- 56. Machine
Learning
Specializa0on
56
Visualizing the ridge cost in 2D
©2015
Emily
Fox
&
Carlos
Guestrin
NX
i=1
w0
w1
RSS Cost
−10 −5 0 5 10
−10
−5
0
5
10
RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi))2 + λ (w0
2+w1
2)2
- 57. Machine
Learning
Specializa0on
57
Visualizing the ridge cost in 2D
©2015
Emily
Fox
&
Carlos
Guestrin
NX
i=1
w0
w1
L2 penalty
−10 −5 0 5 10
−10
−5
0
5
10
RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi))2 + λ (w0
2+w1
2)2
- 58. Machine
Learning
Specializa0on
58
Visualizing the ridge cost in 2D
©2015
Emily
Fox
&
Carlos
Guestrin
NX
i=1
RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi))2 + λ (w0
2+w1
2)2
- 59. Machine
Learning
Specializa0on
59
Visualizing the ridge solution
©2015
Emily
Fox
&
Carlos
Guestrin
NX
i=1
5215
5215
5215
5215
5215
5215
5215
w0
w1
4.75
4.75
level sets intersect
−10 −5 0 5 10
−10
−5
0
5
10
RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi))2 + λ (w0
2+w1
2)2
- 61. Machine
Learning
Specializa0on
61
Visualizing the lasso cost in 2D
©2015
Emily
Fox
&
Carlos
Guestrin
RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi))2 + λ (|w0|+|w1|)
NX
i=1
w0
w1
RSS Cost
−10 −5 0 5 10
−10
−5
0
5
10
- 62. Machine
Learning
Specializa0on
62
Visualizing the lasso cost in 2D
©2015
Emily
Fox
&
Carlos
Guestrin
RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi))2 + λ (|w0|+|w1|)
NX
i=1
w0
w1
L1 penalty
−10 −5 0 5 10
−10
−5
0
5
10
- 63. Machine
Learning
Specializa0on
63
Visualizing the lasso cost in 2D
©2015
Emily
Fox
&
Carlos
Guestrin
RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi))2 + λ (|w0|+|w1|)
NX
i=1
- 64. Machine
Learning
Specializa0on
64
Visualizing the lasso solution
©2015
Emily
Fox
&
Carlos
Guestrin
RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi))2 + λ (|w0|+|w1|)
NX
i=1
5215
5215
5215
5215
5215
5215
5215
w0
w1
2.75
2.75
level sets intersect
−10 −5 0 5 10
−10
−5
0
5
10
- 65. Machine
Learning
Specializa0on
65
Revisit polynomial fit demo
What happens if we refit our high-order
polynomial, but now using lasso regression?
Will consider a few settings of λ …
©2015
Emily
Fox
&
Carlos
Guestrin
- 67. Machine
Learning
Specializa0on
67
©2015
Emily
Fox
&
Carlos
Guestrin
Training
Data
Feature
extraction
ML
model
Quality
metric
ML algorithm
ŵy
ŷh(x)
- 68. Machine
Learning
Specializa0on
68
How we optimized past objectives
To solve for ŵ, previously took gradient
of total cost objective and either:
1) Derived closed-form solution
2) Used in gradient descent algorithm
©2015
Emily
Fox
&
Carlos
Guestrin
- 69. Machine
Learning
Specializa0on
69
Optimizing the lasso objective
Lasso total cost:
Issues:
1) What’s the derivative of |wj|?
2) Even if we could compute derivative, no closed-form solution
©2015
Emily
Fox
&
Carlos
Guestrin
RSS(w) + ||w||1λ
gradients à subgradients
can use subgradient descent
- 71. Machine
Learning
Specializa0on
71
Coordinate descent
Goal: Minimize some function g
Often, hard to find minimum for all coordinates, but easy for each coordinate
Coordinate descent:
Initialize ŵ = 0 (or smartly…)
while not converged
pick a coordinate j
ŵj ß
©2015
Emily
Fox
&
Carlos
Guestrin
- 72. Machine
Learning
Specializa0on
72
Comments on coordinate descent
How do we pick next coordinate?
- At random (“random” or “stochastic” coordinate descent),
round robin, …
No stepsize to choose!
Super useful approach for many problems
- Converges to optimum in some cases
(e.g., “strongly convex”)
- Converges for lasso objective
©2015
Emily
Fox
&
Carlos
Guestrin
- 74. Machine
Learning
Specializa0on
74
p
Normalizing features
Scale training columns (not rows!)
as:
Apply same training scale factors
to test data:
©2015
Emily
Fox
&
Carlos
Guestrin
hj(xk) =
hj(xk)
hj(xi)2
NX
i=1
summing over
training points
apply to
test point
Training
features
Test
features
Normalizer:
zj
Normalizer:
zj
…
hj(xk) =
hj(xk)
p
hj(xi)2
NX
i=1
- 75. Machine
Learning
Specializa0on
Aside 3: Coordinate descent for
unregularized regression
(for normalized features)
©2015
Emily
Fox
&
Carlos
Guestrin
- 76. Machine
Learning
Specializa0on
76
Optimizing least squares objective
one coordinate at a time
Fix all coordinates w-j and take partial w.r.t. wj
©2015
Emily
Fox
&
Carlos
Guestrin
RSS(w) = (yi- wjhj(xi))2
NX
i=1
DX
j=0
NX
i=1
RSS(w) = -2 hj(xi)(yi – wjhj(xi))
∂
∂wj
DX
j=0
- 77. Machine
Learning
Specializa0on
77
Optimizing least squares objective
one coordinate at a time
Set partial = 0 and solve
©2015
Emily
Fox
&
Carlos
Guestrin
RSS(w) = (yi- wjhj(xi))2
NX
i=1
DX
j=0
RSS(w) = -2ρj + 2wj = 0
∂
∂wj
- 78. Machine
Learning
Specializa0on
78
Coordinate descent for
least squares regression
Initialize ŵ = 0 (or smartly…)
while not converged
for j=0,1,…,D
compute:
set: ŵj = ρj
©2015
Emily
Fox
&
Carlos
Guestrin
NX
i=1
ρj = hj(xi)(yi – ŷi(ŵ-j))
prediction
without feature j
residual
without feature j
- 80. Machine
Learning
Specializa0on
80
Coordinate descent for
least squares regression
Initialize ŵ = 0 (or smartly…)
while not converged
for j=0,1,…,D
compute:
set: ŵj = ρj
©2015
Emily
Fox
&
Carlos
Guestrin
NX
i=1
ρj = hj(xi)(yi – ŷi(ŵ-j))
prediction
without feature j
residual
without feature j
- 81. Machine
Learning
Specializa0on
81
Coordinate descent for lasso
Initialize ŵ = 0 (or smartly…)
while not converged
for j=0,1,…,D
compute:
set: ŵj =
©2015
Emily
Fox
&
Carlos
Guestrin
NX
i=1
ρj = hj(xi)(yi – ŷi(ŵ-j))
ρj + λ/2 if ρj < -λ/2
ρj – λ/2 if ρj > λ/2
0 if ρj in [-λ/2, λ/2]
- 82. Machine
Learning
Specializa0on
82
Soft thresholding
©2015
Emily
Fox
&
Carlos
Guestrin
ŵj
ρj
ŵj =
ρj + λ/2 if ρj < -λ/2
ρj – λ/2 if ρj > λ/2
0 if ρj in [-λ/2, λ/2]
- 83. Machine
Learning
Specializa0on
83
How to assess convergence?
Initialize ŵ = 0 (or smartly…)
while not converged
for j=0,1,…,D
compute:
set: ŵj =
©2015
Emily
Fox
&
Carlos
Guestrin
NX
i=1
ρj = hj(xi)(yi – ŷi(ŵ-j))
ρj + λ/2 if ρj < -λ/2
ρj – λ/2 if ρj > λ/2
0 if ρj in [-λ/2, λ/2]
- 84. Machine
Learning
Specializa0on
84
When to stop?
For convex problems, will
start to take smaller and
smaller steps
Measure size of steps
taken in a full loop over
all features
- stop when max step < ε
Convergence criteria
©2015
Emily
Fox
&
Carlos
Guestrin
- 85. Machine
Learning
Specializa0on
85
Other lasso solvers
©2015
Emily
Fox
&
Carlos
Guestrin
Classically: Least angle regression (LARS) [Efron et al. ‘04]
Then: Coordinate descent algorithm
[Fu ‘98, Friedman, Hastie, & Tibshirani ’08]
Now:
• Parallel CD (e.g., Shotgun, [Bradley et al. ‘11])
• Other parallel learning approaches for linear models
- Parallel stochastic gradient descent (SGD) (e.g., Hogwild! [Niu et al. ’11])
- Parallel independent solutions then averaging [Zhang et al. ‘12]
• Alternating directions method of multipliers (ADMM) [Boyd et al. ’11]
- 87. Machine
Learning
Specializa0on
87
Coordinate descent for lasso
with normalized features
Initialize ŵ = 0 (or smartly…)
while not converged
for j=0,1,…,D
compute:
set: ŵj =
©2015
Emily
Fox
&
Carlos
Guestrin
NX
i=1
ρj = hj(xi)(yi – ŷi(ŵ-j))
ρj + λ/2 if ρj < -λ/2
ρj – λ/2 if ρj > λ/2
0 if ρj in [-λ/2, λ/2]
- 88. Machine
Learning
Specializa0on
88
Coordinate descent for lasso
with unnormalized features
Precompute:
Initialize ŵ = 0 (or smartly…)
while not converged
for j=0,1,…,D
compute:
set: ŵj =
©2015
Emily
Fox
&
Carlos
Guestrin
NX
i=1
zj = hj(xi)2
NX
i=1
ρj = hj(xi)(yi – ŷi(ŵ-j))
(ρj + λ/2)/zj if ρj < -λ/2
(ρj – λ/2)/zj if ρj > λ/2
0 if ρj in [-λ/2, λ/2]
- 90. Machine
Learning
Specializa0on
90
Optimizing lasso objective
one coordinate at a time
Fix all coordinates w-j and take partial w.r.t. wj
©2015
Emily
Fox
&
Carlos
Guestrin
RSS(w) + λ||w||1 = (yi- wjhj(xi))2 + λ |wj|
NX
i=1
DX
j=0
DX
j=0
derive without normalizing features
- 91. Machine
Learning
Specializa0on
91
Part 1: Partial of RSS term
©2015
Emily
Fox
&
Carlos
Guestrin
RSS(w) + λ||w||1 = (yi- wjhj(xi))2 + λ |wj|
NX
i=1
DX
j=0
DX
j=0
NX
i=1
RSS(w) = -2 hj(xi)(yi – wjhj(xi))
∂
∂wj
DX
j=0
- 92. Machine
Learning
Specializa0on
92
Part 2: Partial of L1 penalty term
©2015
Emily
Fox
&
Carlos
Guestrin
RSS(w) + λ||w||1 = (yi- wjhj(xi))2 + λ |wj|
NX
i=1
DX
j=0
DX
j=0
λ |wj| = ???∂
∂wj
|x|
x
- 93. Machine
Learning
Specializa0on
93
Subgradients of convex functions
Gradients lower bound convex functions:
Subgradients: Generalize gradients to non-differentiable points:
- Any plane that lower bounds function
©2015
Emily
Fox
&
Carlos
Guestrin
ba
unique at x if function
differentiable at x
g(x)
x
|x|
x
- 94. Machine
Learning
Specializa0on
94
Part 2: Subgradient of L1 term
©2015
Emily
Fox
&
Carlos
Guestrin
RSS(w) + λ||w||1 = (yi- wjhj(xi))2 + λ |wj|
NX
i=1
DX
j=0
DX
j=0
λ∂
|wj| =wj
|wj|
wj
-λ when wj < 0
λ when wj > 0
[-λ,λ] when wj = 0
- 95. Machine
Learning
Specializa0on
95
Putting it all together…
©2015
Emily
Fox
&
Carlos
Guestrin
RSS(w) + λ||w||1 = (yi- wjhj(xi))2 + λ |wj|
NX
i=1
DX
j=0
DX
j=0
∂
[lasso cost] = 2zjwj – 2ρj +wj
-λ when wj < 0
λ when wj > 0
[-λ, λ] when wj = 0
2zjwj – 2ρj – λ when wj < 0
2zjwj – 2ρj + λ when wj > 0
[-2ρj-λ, -2ρj+λ] when wj = 0=
- 96. Machine
Learning
Specializa0on
96
Optimal solution:
Set subgradient = 0
©2015
Emily
Fox
&
Carlos
Guestrin
∂
[lasso cost] = = 0wj
Case 1 (wj < 0):
Case 2 (wj = 0):
Case 3 (wj > 0):
2zjŵj – 2ρj – λ = 0
ŵj = 0
For ŵj < 0, need
For ŵj = 0, need [-2ρj-λ, -2ρj+λ] to contain 0:
2zjŵj – 2ρj + λ = 0 For ŵj > 0, need
2zjwj – 2ρj – λ when wj < 0
2zjwj – 2ρj + λ when wj > 0
[-2ρj-λ, -2ρj+λ] when wj = 0
- 97. Machine
Learning
Specializa0on
97
Optimal solution:
Set subgradient = 0
©2015
Emily
Fox
&
Carlos
Guestrin
∂
[lasso cost] = = 0wj
(ρj + λ/2)/zj if ρj < -λ/2
(ρj – λ/2)/zj if ρj > λ/2
0 if ρj in [-λ/2, λ/2]ŵj =
2zjwj – 2ρj – λ when wj < 0
2zjwj – 2ρj + λ when wj > 0
[-2ρj-λ, -2ρj+λ] when wj = 0
- 98. Machine
Learning
Specializa0on
98
©2015
Emily
Fox
&
Carlos
Guestrin
ŵj
ρj
(ρj + λ/2)/zj if ρj < -λ/2
(ρj – λ/2)/zj if ρj > λ/2
0 if ρj in [-λ/2, λ/2]ŵj =
Soft thresholding
- 99. Machine
Learning
Specializa0on
99
Coordinate descent for lasso
Precompute:
Initialize ŵ = 0 (or smartly…)
while not converged
for j=0,1,…,D
compute:
set: ŵj =
©2015
Emily
Fox
&
Carlos
Guestrin
NX
i=1
zj = hj(xi)2
NX
i=1
ρj = hj(xi)(yi – ŷi(ŵ-j))
(ρj + λ/2)/zj if ρj < -λ/2
(ρj – λ/2)/zj if ρj > λ/2
0 if ρj in [-λ/2, λ/2]
- 101. Machine
Learning
Specializa0on
101
If sufficient amount of data…
©2015
Emily
Fox
&
Carlos
Guestrin
Validation
set
Training set
Test
set
fit ŵλ
test performance
of ŵλ to select λ*
assess
generalization
error of ŵλ*
- 102. Machine
Learning
Specializa0on
102
For k=1,…,K
1. Estimate ŵλ
(k) on the training blocks
2. Compute error on validation block: errork(λ)
Compute average error: CV(λ) = errork(λ)
©2015
Emily
Fox
&
Carlos
Guestrin
Valid
set
ŵλ
(5)
error5(λ)
1
K
KX
k=1
K-fold cross validation
- 103. Machine
Learning
Specializa0on
103
Choosing λ via cross validation
Cross validation is choosing the λ that
provides best predictive accuracy
Tends to favor less sparse solutions, and
thus smaller λ, than optimal choice for
feature selection
c.f., “Machine Learning: A Probabilistic Perspective”,
Murphy, 2012 for further discussion
©2015
Emily
Fox
&
Carlos
Guestrin
- 105. Machine
Learning
Specializa0on
105
Debiasing lasso
Lasso shrinks coefficients
relative to LS solution
à more bias, less variance
Can reduce bias as follows:
1. Run lasso to select
features
2. Run least squares
regression with only
selected features
“Relevant” features no longer
shrunk relative to LS fit of
same reduced model
©2015
Emily
Fox
&
Carlos
Guestrin
Figure used with permission of Mario Figueiredo
(captions modified to fit course)
True coefficients (D=4096, non-zero = 160)
L1 reconstruction (non-zero = 1024, MSE = 0.0072)
Debiased (non-zero = 1024, MSE = 3.26e-005)
Least squares (non-zero = 0, MSE = 1.568)
- 106. Machine
Learning
Specializa0on
106
Issues with standard lasso objective
1. With group of highly correlated features, lasso tends to
select amongst them arbitrarily
- Often prefer to select all together
2. Often, empirically ridge has better predictive performance
than lasso, but lasso leads to sparser solution
Elastic net aims to address these issues
- hybrid between lasso and ridge regression
- uses L1 and L2 penalties
See Zou & Hastie ‘05 for further discussion
©2015
Emily
Fox
&
Carlos
Guestrin
- 108. Machine
Learning
Specializa0on
108
Impact of feature selection and lasso
Lasso has changed machine learning,
statistics, & electrical engineering
But, for feature selection in general, be careful
about interpreting selected features
- selection only considers features included
- sensitive to correlations between features
- result depends on algorithm used
- there are theoretical guarantees for lasso under
certain conditions
©2015
Emily
Fox
&
Carlos
Guestrin
- 109. Machine
Learning
Specializa0on
109
What you can do now…
• Perform feature selection using “all subsets” and “forward
stepwise” algorithms
• Analyze computational costs of these algorithms
• Contrast greedy and optimal algorithms
• Formulate lasso objective
• Describe what happens to estimated lasso coefficients as
tuning parameter λ is varied
• Interpret lasso coefficient path plot
• Contrast ridge and lasso regression
• Describe geometrically why L1 penalty leads to sparsity
• Estimate lasso regression parameters using an iterative
coordinate descent algorithm
• Implement K-fold cross validation to select lasso tuning
parameter λ
©2015
Emily
Fox
&
Carlos
Guestrin