SlideShare une entreprise Scribd logo
1  sur  109
Télécharger pour lire hors ligne
Machine	
  Learning	
  Specializa0on	
  
Lasso Regression:
Regularization for feature selection
Emily Fox & Carlos Guestrin
Machine Learning Specialization
University of Washington
1 ©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  
Feature selection task
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  3	
  
Efficiency:
-  If size(w) = 100B, each prediction is expensive
-  If ŵ sparse , computation only depends on # of non-zeros
Interpretability:
-  Which features are relevant for prediction?
Why might you want to perform
feature selection?
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  3
many zeros
=
X
ˆwj 6=0
ŷi = ŵj hj(xi)
Machine	
  Learning	
  Specializa0on	
  4	
  
Sparsity:
Housing application
$ ?
Lot	
  size	
  
Single	
  Family	
  
Year	
  built	
  
Last	
  sold	
  price	
  
Last	
  sale	
  price/sqI	
  
Finished	
  sqI	
  
Unfinished	
  sqI	
  
Finished	
  basement	
  sqI	
  
#	
  floors	
  
Flooring	
  types	
  
Parking	
  type	
  
Parking	
  amount	
  
Cooling	
  
Hea0ng	
  
Exterior	
  materials	
  
Roof	
  type	
  
Structure	
  style	
  
	
  
Dishwasher	
  
Garbage	
  disposal	
  
Microwave	
  
Range	
  /	
  Oven	
  
Refrigerator	
  
Washer	
  
Dryer	
  
Laundry	
  loca0on	
  
Hea0ng	
  type	
  
JeYed	
  Tub	
  
Deck	
  
Fenced	
  Yard	
  
Lawn	
  
Garden	
  
Sprinkler	
  System	
  
	
  
	
  
Lot	
  size	
  
Single	
  Family	
  
Year	
  built	
  
Last	
  sold	
  price	
  
Last	
  sale	
  price/sqI	
  
Finished	
  sqI	
  
Unfinished	
  sqI	
  
Finished	
  basement	
  sqI	
  
#	
  floors	
  
Flooring	
  types	
  
Parking	
  type	
  
Parking	
  amount	
  
Cooling	
  
Hea0ng	
  
Exterior	
  materials	
  
Roof	
  type	
  
Structure	
  style	
  
	
  
Dishwasher	
  
Garbage	
  disposal	
  
Microwave	
  
Range	
  /	
  Oven	
  
Refrigerator	
  
Washer	
  
Dryer	
  
Laundry	
  loca0on	
  
Hea0ng	
  type	
  
JeYed	
  Tub	
  
Deck	
  
Fenced	
  Yard	
  
Lawn	
  
Garden	
  
Sprinkler	
  System	
  
	
  
	
  
…
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  5	
  
Sparsity:
Reading your mind
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
very sad very happy
Activity in which
brain regions
can predict
happiness?	
  
Machine	
  Learning	
  Specializa0on	
  
Option 1: All subsets
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  7	
  
Find best model of size: 0
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
# of features
RSS(ŵ)
0
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  floors
-  year built
-  year renovated
-  waterfront
Machine	
  Learning	
  Specializa0on	
  8	
  
Find best model of size: 1
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
# of features
RSS(ŵ)
0 1
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  floors
-  year built
-  year renovated
-  waterfront
Machine	
  Learning	
  Specializa0on	
  9	
  
Find best model of size: 1
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
# of features
RSS(ŵ)
0 1
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  floors
-  year built
-  year renovated
-  waterfront
Machine	
  Learning	
  Specializa0on	
  10	
  
Find best model of size: 1
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
# of features
RSS(ŵ)
0 1
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  floors
-  year built
-  year renovated
-  waterfront
Machine	
  Learning	
  Specializa0on	
  11	
  
Find best model of size: 1
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
# of features
RSS(ŵ)
0 1
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  floors
-  year built
-  year renovated
-  waterfront
Machine	
  Learning	
  Specializa0on	
  12	
  
Find best model of size: 1
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
# of features
RSS(ŵ)
0 1
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  floors
-  year built
-  year renovated
-  waterfront
Machine	
  Learning	
  Specializa0on	
  13	
  
Find best model of size: 1
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
# of features
RSS(ŵ)
0 1
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  floors
-  year built
-  year renovated
-  waterfront
Machine	
  Learning	
  Specializa0on	
  14	
  
Find best model of size: 1
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
# of features
RSS(ŵ)
0 1
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  floors
-  year built
-  year renovated
-  waterfront
Machine	
  Learning	
  Specializa0on	
  15	
  
Find best model of size: 1
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
# of features
RSS(ŵ)
0 1
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  floors
-  year built
-  year renovated
-  waterfront
Machine	
  Learning	
  Specializa0on	
  16	
  
Find best model of size: 1
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
# of features
RSS(ŵ)
0 1
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  floors
-  year built
-  year renovated
-  waterfront
Machine	
  Learning	
  Specializa0on	
  17	
  
Find best model of size: 2
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
# of features
RSS(ŵ)
0 1 2
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  floors
-  year built
-  year renovated
-  waterfront
Machine	
  Learning	
  Specializa0on	
  18	
  
Note: Not necessarily nested!
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
# of features
RSS(ŵ)
0 1 2
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  floors
-  year built
-  year renovated
-  waterfront
Machine	
  Learning	
  Specializa0on	
  19	
  
Note: Not necessarily nested!
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
# of features
RSS(ŵ)
0 1 2
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  floors
-  year built
-  year renovated
-  waterfront
Machine	
  Learning	
  Specializa0on	
  20	
  
Find best model of size: 3
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
# of features
RSS(ŵ)
0 1 2 3
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  floors
-  year built
-  year renovated
-  waterfront
Machine	
  Learning	
  Specializa0on	
  21	
  
Find best model of size: 4
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
# of features
RSS(ŵ)
0 1 2 3 4
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  floors
-  year built
-  year renovated
-  waterfront
Machine	
  Learning	
  Specializa0on	
  22	
  
Find best model of size: 5
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
# of features
RSS(ŵ)
0 1 2 3 4 5
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  floors
-  year built
-  year renovated
-  waterfront
Machine	
  Learning	
  Specializa0on	
  23	
  
Find best model of size: 6
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
# of features
RSS(ŵ)
0 1 2 3 4 5 6
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  floors
-  year built
-  year renovated
-  waterfront
Machine	
  Learning	
  Specializa0on	
  24	
  
Find best model of size: 7
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
# of features
RSS(ŵ)
0 1 2 3 4 5 6 7
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  floors
-  year built
-  year renovated
-  waterfront
Machine	
  Learning	
  Specializa0on	
  25	
  
Find best model of size: 8
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
# of features
RSS(ŵ)
0 1 2 3 4 5 6 7 8
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  floors
-  year built
-  year renovated
-  waterfront
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  floors
-  year built
-  year renovated
-  waterfront
Machine	
  Learning	
  Specializa0on	
  26	
  
Choosing model complexity?
Option 1: Assess on validation set
Option 2: Cross validation
Option 3+: Other metrics for penalizing
model complexity like BIC…
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  27	
  
Complexity of “all subsets”
How many models were evaluated?
-  each indexed by features included
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
yi = εi
yi = w0h0(xi) + εi
yi = w1 h1(xi) + εi
yi = w0h0(xi) + w1 h1(xi) + εi
yi = w0h0(xi) + w1 h1(xi) + … + wD hD(xi)+ εi
……
[0 0 0 … 0 0 0]
[1 0 0 … 0 0 0]
[0 1 0 … 0 0 0]
[1 1 0 … 0 0 0]
[1 1 1 … 1 1 1]
……
2D
28 = 256
230 = 1,073,741,824
21000 = 1.071509 x 10301
2100B = HUGE!!!!!!
Typically,
computationally
infeasible
Machine	
  Learning	
  Specializa0on	
  
Option 2: Greedy algorithms
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  29	
  
Forward stepwise algorithm
1.  Pick a dictionary of features {h0(x),…,hD(x)}
- e.g., polynomials for linear regression
2.  Greedy heuristic:
i.  Start with empty set of features F0 = ∅
(or simple set, like just h0(x)=1 à yi= w0+εi)
ii.  Fit model using current feature set Ft to get ŵ(t)
iii.  Select next best feature hj*(x)
-  e.g., hj(x) resulting in lowest training error
when learning with Ft + {hj(x)}
iv.  Set Ft+1 ç Ft + {hj*(x)}
v.  Recurse
29
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  30	
  
Visualizing greedy procedure
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
# of features
RSS(ŵ)
0 1 2 3 4 5 6 7 8
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  floors
-  year built
-  year renovated
-  waterfront
Machine	
  Learning	
  Specializa0on	
  31	
  
Visualizing greedy procedure
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
# of features
RSS(ŵ)
0 1 2 3 4 5 6 7 8
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  floors
-  year built
-  year renovated
-  waterfront
Machine	
  Learning	
  Specializa0on	
  32	
  
Visualizing greedy procedure
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
# of features
RSS(ŵ)
0 1 2 3 4 5 6 7 8
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  floors
-  year built
-  year renovated
-  waterfront
Machine	
  Learning	
  Specializa0on	
  33	
  
Visualizing greedy procedure
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
# of features
RSS(ŵ)
0 1 2 3 4 5 6 7 8
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  floors
-  year built
-  year renovated
-  waterfront
Machine	
  Learning	
  Specializa0on	
  34	
  
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  floors
-  year built
-  year renovated
-  waterfront
Visualizing greedy procedure
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
# of features
RSS(ŵ)
0 1 2 3 4 5 6 7 8
Machine	
  Learning	
  Specializa0on	
  35	
  
Visualizing greedy procedure
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
# of features
RSS(ŵ)
0 1 2 3 4 5 6 7 8
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  floors
-  year built
-  year renovated
-  waterfront
Machine	
  Learning	
  Specializa0on	
  36	
  
Visualizing greedy procedure
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
# of features
RSS(ŵ)
0 1 2 3 4 5 6 7 8
-  # bedrooms
-  # bathrooms
-  sq.ft. living
-  sq.ft. lot
-  floors
-  year built
-  year renovated
-  waterfront
error never increases
+
solutions eventually meet
Machine	
  Learning	
  Specializa0on	
  37	
  
When do we stop?
When training error is low enough?
When test error is low enough?
Use validation set or cross validation!
37
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
No!
No!
Machine	
  Learning	
  Specializa0on	
  38	
  
Complexity of forward stepwise
How many models were evaluated?
-  1st step, D models
-  2nd step, D-1 models (add 1 feature out of D-1 possible)
-  3rd step, D-2 models (add 1 feature out of D-2 possible)
-  …
How many steps?
-  Depends
-  At most D steps (to full model)
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
O(D2) << 2D
for large D
Machine	
  Learning	
  Specializa0on	
  39	
  
Other greedy algorithms
Instead of starting from simple model
and always growing…
Backward stepwise:
Start with full model and iteratively remove
features least useful to fit
Combining forward and backward steps:
In forward algorithm, insert steps to remove
features no longer as important
Lots of other variants, too.
39
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  
Option 3: Regularize
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  41	
  
Ridge regression:
L2 regularized regression
Total cost =
measure of fit + λ measure of magnitude
of coefficients
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
RSS(w)
||w||2=w0
2+…+wD
2
2	
  
Encourages small weights
but not exactly 0
Machine	
  Learning	
  Specializa0on	
  42	
  
0 50000 100000 150000 200000
−1000000100000200000 coefficient paths −− Ridge
coefficient
bedrooms
bathrooms
sqft_living
sqft_lot
floors
yr_built
yr_renovat
waterfront
Coefficient path – ridge
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
λ
coefficientsŵj
Machine	
  Learning	
  Specializa0on	
  43	
  
Efficiency:
-  If size(w) = 100B, each prediction is expensive
-  If ŵ sparse , computation only depends on # of non-zeros
Interpretability:
-  Which features are relevant for prediction?
Recall sparsity (many ŵj=0)
gives efficiency and interpretability
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
many zeros
=
X
ˆwj 6=0
ŷi = ŵj hj(xi)
Machine	
  Learning	
  Specializa0on	
  44	
  
Using regularization for
feature selection
Instead of searching over a discrete set of
solutions, can we use regularization?
- Start with full model (all possible features)
- “Shrink” some coefficients exactly to 0
•  i.e., knock out certain features
- Non-zero coefficients indicate “selected” features
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  45	
  
Thresholding ridge coefficients?
Why don’t we just set small ridge coefficients to 0?
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
0
Machine	
  Learning	
  Specializa0on	
  46	
  
Thresholding ridge coefficients?
Selected features for a given threshold value
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
0
Machine	
  Learning	
  Specializa0on	
  47	
  
Thresholding ridge coefficients?
Let’s look at two related features…
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
0
Nothing measuring bathrooms was included!
Machine	
  Learning	
  Specializa0on	
  48	
  
Thresholding ridge coefficients?
If only one of the features had been included…
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
0
Machine	
  Learning	
  Specializa0on	
  49	
  
Thresholding ridge coefficients?
Would have included bathrooms in selected model
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
0
Can regularization lead directly to sparsity?
Machine	
  Learning	
  Specializa0on	
  50	
  
Try this cost instead of ridge…
Total cost =
measure of fit + λ measure of magnitude
of coefficients
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
RSS(w)
||w||1=|w0|+…+|wD|
Lasso regression
(a.k.a. L1 regularized regression)
Leads to
sparse
solutions!
Machine	
  Learning	
  Specializa0on	
  51	
  
Lasso regression:
L1 regularized regression
Just like ridge regression, solution is
governed by a continuous parameter λ
If λ=0:
If λ=∞: 	
  
If λ in between:
RSS(w) + λ||w||1
tuning parameter = balance of fit and sparsity
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  52	
  
0 50000 100000 150000 200000
−1000000100000200000 coefficient paths −− Ridge
coefficient
bedrooms
bathrooms
sqft_living
sqft_lot
floors
yr_built
yr_renovat
waterfront
Coefficient path – ridge
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
λ
coefficientsŵj
Machine	
  Learning	
  Specializa0on	
  53	
  
0 50000 100000 150000 200000
−1000000100000200000 coefficient paths −− LASSO
coefficient
bedrooms
bathrooms
sqft_living
sqft_lot
floors
yr_built
yr_renovat
waterfront
Coefficient path – lasso
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
λ
coefficientsŵj
Machine	
  Learning	
  Specializa0on	
  
Geometric intuition for
sparsity of lasso solution
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  
Geometric intuition for
ridge regression
55 ©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  56	
  
Visualizing the ridge cost in 2D
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
NX
i=1
w0
w1
RSS Cost
−10 −5 0 5 10
−10
−5
0
5
10
RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi))2 + λ (w0
2+w1
2)2	
  
Machine	
  Learning	
  Specializa0on	
  57	
  
Visualizing the ridge cost in 2D
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
NX
i=1
w0
w1
L2 penalty
−10 −5 0 5 10
−10
−5
0
5
10
RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi))2 + λ (w0
2+w1
2)2	
  
Machine	
  Learning	
  Specializa0on	
  58	
  
Visualizing the ridge cost in 2D
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
NX
i=1
RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi))2 + λ (w0
2+w1
2)2	
  
Machine	
  Learning	
  Specializa0on	
  59	
  
Visualizing the ridge solution
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
NX
i=1
5215
5215
5215
5215
5215
5215
5215
w0
w1
4.75
4.75
level sets intersect
−10 −5 0 5 10
−10
−5
0
5
10
RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi))2 + λ (w0
2+w1
2)2	
  
Machine	
  Learning	
  Specializa0on	
  
Geometric intuition for lasso
60 ©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  61	
  
Visualizing the lasso cost in 2D
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi))2 + λ (|w0|+|w1|)
NX
i=1
w0
w1
RSS Cost
−10 −5 0 5 10
−10
−5
0
5
10
Machine	
  Learning	
  Specializa0on	
  62	
  
Visualizing the lasso cost in 2D
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi))2 + λ (|w0|+|w1|)
NX
i=1
w0
w1
L1 penalty
−10 −5 0 5 10
−10
−5
0
5
10
Machine	
  Learning	
  Specializa0on	
  63	
  
Visualizing the lasso cost in 2D
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi))2 + λ (|w0|+|w1|)
NX
i=1
Machine	
  Learning	
  Specializa0on	
  64	
  
Visualizing the lasso solution
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi))2 + λ (|w0|+|w1|)
NX
i=1
5215
5215
5215
5215
5215
5215
5215
w0
w1
2.75
2.75
level sets intersect
−10 −5 0 5 10
−10
−5
0
5
10
Machine	
  Learning	
  Specializa0on	
  65	
  
Revisit polynomial fit demo
What happens if we refit our high-order
polynomial, but now using lasso regression?
Will consider a few settings of λ …
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  
Fitting the lasso regression model
(for given λ value)
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  67	
   ©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Training
Data
Feature
extraction
ML
model
Quality
metric
ML algorithm
ŵy
ŷh(x)
Machine	
  Learning	
  Specializa0on	
  68	
  
How we optimized past objectives
To solve for ŵ, previously took gradient
of total cost objective and either:
1)  Derived closed-form solution
2)  Used in gradient descent algorithm
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  69	
  
Optimizing the lasso objective
Lasso total cost:
Issues:
1)  What’s the derivative of |wj|?
2)  Even if we could compute derivative, no closed-form solution
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
RSS(w) + ||w||1λ
gradients à subgradients
can use subgradient descent
Machine	
  Learning	
  Specializa0on	
  
Aside 1: Coordinate descent
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  71	
  
Coordinate descent
Goal: Minimize some function g
Often, hard to find minimum for all coordinates, but easy for each coordinate
Coordinate descent:
Initialize ŵ = 0 (or smartly…)
while not converged
pick a coordinate j
ŵj ß
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  72	
  
Comments on coordinate descent
How do we pick next coordinate?
-  At random (“random” or “stochastic” coordinate descent),
round robin, …
No stepsize to choose!
Super useful approach for many problems
-  Converges to optimum in some cases
(e.g., “strongly convex”)
-  Converges for lasso objective
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  
Aside 2: Normalizing features
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  74	
  
p
Normalizing features
Scale training columns (not rows!)
as:
Apply same training scale factors
to test data:
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
hj(xk) =
hj(xk)
hj(xi)2
NX
i=1
summing over
training points
apply to
test point
Training
features
Test
features
Normalizer:
zj
Normalizer:
zj
…
hj(xk) =
hj(xk)
p
hj(xi)2
NX
i=1
Machine	
  Learning	
  Specializa0on	
  
Aside 3: Coordinate descent for
unregularized regression
(for normalized features)
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  76	
  
Optimizing least squares objective
one coordinate at a time
Fix all coordinates w-j and take partial w.r.t. wj
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
RSS(w) = (yi- wjhj(xi))2
NX
i=1
DX
j=0
NX
i=1
RSS(w) = -2 hj(xi)(yi – wjhj(xi))
∂
∂wj
DX
j=0
Machine	
  Learning	
  Specializa0on	
  77	
  
Optimizing least squares objective
one coordinate at a time
Set partial = 0 and solve
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
RSS(w) = (yi- wjhj(xi))2
NX
i=1
DX
j=0
RSS(w) = -2ρj + 2wj = 0
∂
∂wj
Machine	
  Learning	
  Specializa0on	
  78	
  
Coordinate descent for
least squares regression
Initialize ŵ = 0 (or smartly…)
while not converged
for j=0,1,…,D
compute:
set: ŵj = ρj
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
NX
i=1
ρj = hj(xi)(yi – ŷi(ŵ-j))
prediction
without feature j
residual
without feature j
Machine	
  Learning	
  Specializa0on	
  
Coordinate descent for lasso
(for normalized features)
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  80	
  
Coordinate descent for
least squares regression
Initialize ŵ = 0 (or smartly…)
while not converged
for j=0,1,…,D
compute:
set: ŵj = ρj
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
NX
i=1
ρj = hj(xi)(yi – ŷi(ŵ-j))
prediction
without feature j
residual
without feature j
Machine	
  Learning	
  Specializa0on	
  81	
  
Coordinate descent for lasso
Initialize ŵ = 0 (or smartly…)
while not converged
for j=0,1,…,D
compute:
set: ŵj =
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
NX
i=1
ρj = hj(xi)(yi – ŷi(ŵ-j))
ρj + λ/2 if ρj < -λ/2
ρj – λ/2 if ρj > λ/2
0 if ρj in [-λ/2, λ/2]
Machine	
  Learning	
  Specializa0on	
  82	
  
Soft thresholding
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
ŵj
ρj
ŵj =
ρj + λ/2 if ρj < -λ/2
ρj – λ/2 if ρj > λ/2
0 if ρj in [-λ/2, λ/2]
Machine	
  Learning	
  Specializa0on	
  83	
  
How to assess convergence?
Initialize ŵ = 0 (or smartly…)
while not converged
for j=0,1,…,D
compute:
set: ŵj =
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
NX
i=1
ρj = hj(xi)(yi – ŷi(ŵ-j))
ρj + λ/2 if ρj < -λ/2
ρj – λ/2 if ρj > λ/2
0 if ρj in [-λ/2, λ/2]
Machine	
  Learning	
  Specializa0on	
  84	
  
When to stop?
For convex problems, will
start to take smaller and
smaller steps
Measure size of steps
taken in a full loop over
all features
-  stop when max step < ε
Convergence criteria
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  85	
  
Other lasso solvers
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Classically: Least angle regression (LARS) [Efron et al. ‘04]
Then: Coordinate descent algorithm
[Fu ‘98, Friedman, Hastie, & Tibshirani ’08]
Now:
•  Parallel CD (e.g., Shotgun, [Bradley et al. ‘11])
•  Other parallel learning approaches for linear models
-  Parallel stochastic gradient descent (SGD) (e.g., Hogwild! [Niu et al. ’11])
-  Parallel independent solutions then averaging [Zhang et al. ‘12]
•  Alternating directions method of multipliers (ADMM) [Boyd et al. ’11]
Machine	
  Learning	
  Specializa0on	
  
Coordinate descent for lasso
(for unnormalized features)
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  87	
  
Coordinate descent for lasso
with normalized features
Initialize ŵ = 0 (or smartly…)
while not converged
for j=0,1,…,D
compute:
set: ŵj =
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
NX
i=1
ρj = hj(xi)(yi – ŷi(ŵ-j))
ρj + λ/2 if ρj < -λ/2
ρj – λ/2 if ρj > λ/2
0 if ρj in [-λ/2, λ/2]
Machine	
  Learning	
  Specializa0on	
  88	
  
Coordinate descent for lasso
with unnormalized features
Precompute:
Initialize ŵ = 0 (or smartly…)
while not converged
for j=0,1,…,D
compute:
set: ŵj =
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
NX
i=1
zj = hj(xi)2
NX
i=1
ρj = hj(xi)(yi – ŷi(ŵ-j))
(ρj + λ/2)/zj if ρj < -λ/2
(ρj – λ/2)/zj if ρj > λ/2
0 if ρj in [-λ/2, λ/2]
Machine	
  Learning	
  Specializa0on	
  
Deriving the lasso coordinate
descent update
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
OPTIONAL
Machine	
  Learning	
  Specializa0on	
  90	
  
Optimizing lasso objective
one coordinate at a time
Fix all coordinates w-j and take partial w.r.t. wj
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
RSS(w) + λ||w||1 = (yi- wjhj(xi))2 + λ |wj|
NX
i=1
DX
j=0
DX
j=0
derive without normalizing features
Machine	
  Learning	
  Specializa0on	
  91	
  
Part 1: Partial of RSS term
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
RSS(w) + λ||w||1 = (yi- wjhj(xi))2 + λ |wj|
NX
i=1
DX
j=0
DX
j=0
NX
i=1
RSS(w) = -2 hj(xi)(yi – wjhj(xi))
∂
∂wj
DX
j=0
Machine	
  Learning	
  Specializa0on	
  92	
  
Part 2: Partial of L1 penalty term
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
RSS(w) + λ||w||1 = (yi- wjhj(xi))2 + λ |wj|
NX
i=1
DX
j=0
DX
j=0
λ |wj| = ???∂
∂wj
|x|
x
Machine	
  Learning	
  Specializa0on	
  93	
  
Subgradients of convex functions
Gradients lower bound convex functions:
Subgradients: Generalize gradients to non-differentiable points:
-  Any plane that lower bounds function
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
ba
unique at x if function
differentiable at x
g(x)
x
|x|
x
Machine	
  Learning	
  Specializa0on	
  94	
  
Part 2: Subgradient of L1 term
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
RSS(w) + λ||w||1 = (yi- wjhj(xi))2 + λ |wj|
NX
i=1
DX
j=0
DX
j=0
λ∂	
  	
  	
  |wj| =wj
|wj|
wj
-λ when wj < 0
λ when wj > 0
[-λ,λ] when wj = 0
Machine	
  Learning	
  Specializa0on	
  95	
  
Putting it all together…
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
RSS(w) + λ||w||1 = (yi- wjhj(xi))2 + λ |wj|
NX
i=1
DX
j=0
DX
j=0
∂	
  	
  	
  [lasso cost] = 2zjwj – 2ρj +wj
-λ when wj < 0
λ when wj > 0
[-λ, λ] when wj = 0
2zjwj – 2ρj – λ when wj < 0
2zjwj – 2ρj + λ when wj > 0
[-2ρj-λ, -2ρj+λ] when wj = 0=	
  
Machine	
  Learning	
  Specializa0on	
  96	
  
Optimal solution:
Set subgradient = 0
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
∂	
  	
  	
  [lasso cost] = = 0wj
Case 1 (wj < 0):
Case 2 (wj = 0):
Case 3 (wj > 0):
2zjŵj – 2ρj – λ = 0
ŵj = 0
For ŵj < 0, need
For ŵj = 0, need [-2ρj-λ, -2ρj+λ] to contain 0:
2zjŵj – 2ρj + λ = 0 For ŵj > 0, need
2zjwj – 2ρj – λ when wj < 0
2zjwj – 2ρj + λ when wj > 0
[-2ρj-λ, -2ρj+λ] when wj = 0
Machine	
  Learning	
  Specializa0on	
  97	
  
Optimal solution:
Set subgradient = 0
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
∂	
  	
  	
  [lasso cost] = = 0wj
(ρj + λ/2)/zj if ρj < -λ/2
(ρj – λ/2)/zj if ρj > λ/2
0 if ρj in [-λ/2, λ/2]ŵj =
2zjwj – 2ρj – λ when wj < 0
2zjwj – 2ρj + λ when wj > 0
[-2ρj-λ, -2ρj+λ] when wj = 0
Machine	
  Learning	
  Specializa0on	
  98	
   ©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
ŵj
ρj
(ρj + λ/2)/zj if ρj < -λ/2
(ρj – λ/2)/zj if ρj > λ/2
0 if ρj in [-λ/2, λ/2]ŵj =
Soft thresholding
Machine	
  Learning	
  Specializa0on	
  99	
  
Coordinate descent for lasso
Precompute:
Initialize ŵ = 0 (or smartly…)
while not converged
for j=0,1,…,D
compute:
set: ŵj =
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
NX
i=1
zj = hj(xi)2
NX
i=1
ρj = hj(xi)(yi – ŷi(ŵ-j))
(ρj + λ/2)/zj if ρj < -λ/2
(ρj – λ/2)/zj if ρj > λ/2
0 if ρj in [-λ/2, λ/2]
Machine	
  Learning	
  Specializa0on	
  
How to choose λ
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  101	
  
If sufficient amount of data…
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Validation
set
Training set
Test
set
fit ŵλ
test performance
of ŵλ to select λ*
assess
generalization
error of ŵλ*
Machine	
  Learning	
  Specializa0on	
  102	
  
For k=1,…,K
1.  Estimate ŵλ
(k) on the training blocks
2.  Compute error on validation block: errork(λ)
Compute average error: CV(λ) = errork(λ)
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Valid
set
ŵλ
(5)	
   error5(λ)
1
K
KX
k=1
K-fold cross validation
Machine	
  Learning	
  Specializa0on	
  103	
  
Choosing λ via cross validation
Cross validation is choosing the λ that
provides best predictive accuracy
Tends to favor less sparse solutions, and
thus smaller λ, than optimal choice for
feature selection
c.f., “Machine Learning: A Probabilistic Perspective”,
Murphy, 2012 for further discussion
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  
Practical concerns with lasso
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  105	
  
Debiasing lasso
Lasso shrinks coefficients
relative to LS solution
à  more bias, less variance
Can reduce bias as follows:
1.  Run lasso to select
features
2.  Run least squares
regression with only
selected features
“Relevant” features no longer
shrunk relative to LS fit of
same reduced model
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Figure used with permission of Mario Figueiredo
(captions modified to fit course)
True coefficients (D=4096, non-zero = 160)
L1 reconstruction (non-zero = 1024, MSE = 0.0072)
Debiased (non-zero = 1024, MSE = 3.26e-005)
Least squares (non-zero = 0, MSE = 1.568)
Machine	
  Learning	
  Specializa0on	
  106	
  
Issues with standard lasso objective
1.  With group of highly correlated features, lasso tends to
select amongst them arbitrarily
-  Often prefer to select all together
2.  Often, empirically ridge has better predictive performance
than lasso, but lasso leads to sparser solution
Elastic net aims to address these issues
-  hybrid between lasso and ridge regression
-  uses L1 and L2 penalties
See Zou & Hastie ‘05 for further discussion
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  
Summary for feature selection
and lasso regression
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  108	
  
Impact of feature selection and lasso
Lasso has changed machine learning,
statistics, & electrical engineering
But, for feature selection in general, be careful
about interpreting selected features
- selection only considers features included
- sensitive to correlations between features
- result depends on algorithm used
- there are theoretical guarantees for lasso under
certain conditions
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  
Machine	
  Learning	
  Specializa0on	
  109	
  
What you can do now…
•  Perform feature selection using “all subsets” and “forward
stepwise” algorithms
•  Analyze computational costs of these algorithms
•  Contrast greedy and optimal algorithms
•  Formulate lasso objective
•  Describe what happens to estimated lasso coefficients as
tuning parameter λ is varied
•  Interpret lasso coefficient path plot
•  Contrast ridge and lasso regression
•  Describe geometrically why L1 penalty leads to sparsity
•  Estimate lasso regression parameters using an iterative
coordinate descent algorithm
•  Implement K-fold cross validation to select lasso tuning
parameter λ
©2015	
  Emily	
  Fox	
  &	
  Carlos	
  Guestrin	
  

Contenu connexe

Tendances

Extract information from pie
Extract information from pieExtract information from pie
Extract information from pieNadeem Uddin
 
7th PreAlg - L74--May4
7th PreAlg - L74--May47th PreAlg - L74--May4
7th PreAlg - L74--May4jdurst65
 
Applications of calculus in commerce and economics
Applications of calculus in commerce and economicsApplications of calculus in commerce and economics
Applications of calculus in commerce and economicssumanmathews
 
Applications of calculus in commerce and economics ii
Applications of calculus in commerce and economics ii Applications of calculus in commerce and economics ii
Applications of calculus in commerce and economics ii sumanmathews
 
Day 4 evaluating with add and subtract
Day 4 evaluating with add and subtractDay 4 evaluating with add and subtract
Day 4 evaluating with add and subtractErik Tjersland
 
Tutorials--Solving One-Step Equations
Tutorials--Solving One-Step EquationsTutorials--Solving One-Step Equations
Tutorials--Solving One-Step EquationsMedia4math
 
Percentages for competetive exams
Percentages for  competetive examsPercentages for  competetive exams
Percentages for competetive examsavdheshtripathi2
 
Decimal arithematic operation
Decimal arithematic operationDecimal arithematic operation
Decimal arithematic operationPadmapriyaG
 

Tendances (11)

Algebra
AlgebraAlgebra
Algebra
 
Extract information from pie
Extract information from pieExtract information from pie
Extract information from pie
 
7th PreAlg - L74--May4
7th PreAlg - L74--May47th PreAlg - L74--May4
7th PreAlg - L74--May4
 
Gm
GmGm
Gm
 
Applications of calculus in commerce and economics
Applications of calculus in commerce and economicsApplications of calculus in commerce and economics
Applications of calculus in commerce and economics
 
Applications of calculus in commerce and economics ii
Applications of calculus in commerce and economics ii Applications of calculus in commerce and economics ii
Applications of calculus in commerce and economics ii
 
Day 4 evaluating with add and subtract
Day 4 evaluating with add and subtractDay 4 evaluating with add and subtract
Day 4 evaluating with add and subtract
 
TALLER N°2
TALLER N°2TALLER N°2
TALLER N°2
 
Tutorials--Solving One-Step Equations
Tutorials--Solving One-Step EquationsTutorials--Solving One-Step Equations
Tutorials--Solving One-Step Equations
 
Percentages for competetive exams
Percentages for  competetive examsPercentages for  competetive exams
Percentages for competetive exams
 
Decimal arithematic operation
Decimal arithematic operationDecimal arithematic operation
Decimal arithematic operation
 

En vedette

Ridge regression, lasso and elastic net
Ridge regression, lasso and elastic netRidge regression, lasso and elastic net
Ridge regression, lasso and elastic netVivian S. Zhang
 
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsData Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsDerek Kane
 
Predicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized Model
Predicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized ModelPredicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized Model
Predicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized Modelweekendsunny
 
Signal Processing Course : Sparse Regularization of Inverse Problems
Signal Processing Course : Sparse Regularization of Inverse ProblemsSignal Processing Course : Sparse Regularization of Inverse Problems
Signal Processing Course : Sparse Regularization of Inverse ProblemsGabriel Peyré
 
Signal Processing Course : Convex Optimization
Signal Processing Course : Convex OptimizationSignal Processing Course : Convex Optimization
Signal Processing Course : Convex OptimizationGabriel Peyré
 
Digital Signal Processing - Practical Techniques, Tips and Tricks Course Sampler
Digital Signal Processing - Practical Techniques, Tips and Tricks Course SamplerDigital Signal Processing - Practical Techniques, Tips and Tricks Course Sampler
Digital Signal Processing - Practical Techniques, Tips and Tricks Course SamplerJim Jenkins
 
Study of volatility in foreign exchange market
Study of volatility in foreign exchange marketStudy of volatility in foreign exchange market
Study of volatility in foreign exchange marketSunita Jindal
 
Robotics: Modelling, Planning and Control
Robotics: Modelling, Planning and ControlRobotics: Modelling, Planning and Control
Robotics: Modelling, Planning and ControlCody Ray
 
Holistic slides lightning_testing_bof
Holistic slides lightning_testing_bofHolistic slides lightning_testing_bof
Holistic slides lightning_testing_bofTerry Peppers
 
Open Risk Analysis Software - Data And Methodologies
Open Risk Analysis Software - Data And MethodologiesOpen Risk Analysis Software - Data And Methodologies
Open Risk Analysis Software - Data And MethodologiesChristakis Mina, PhD, ACIArb
 
structural equation modeling
structural equation modelingstructural equation modeling
structural equation modelingQiang Hao
 
A study on functioning of foreign exchange market
A study on functioning of foreign exchange marketA study on functioning of foreign exchange market
A study on functioning of foreign exchange marketVivek Mahajan
 
Seminar on Robust Regression Methods
Seminar on Robust Regression MethodsSeminar on Robust Regression Methods
Seminar on Robust Regression MethodsSumon Sdb
 
4thchannel conference poster_freedom_gumedze
4thchannel conference poster_freedom_gumedze4thchannel conference poster_freedom_gumedze
4thchannel conference poster_freedom_gumedzeFreedom Gumedze
 
A_Study_on_the_Medieval_Kerala_School_of_Mathematics
A_Study_on_the_Medieval_Kerala_School_of_MathematicsA_Study_on_the_Medieval_Kerala_School_of_Mathematics
A_Study_on_the_Medieval_Kerala_School_of_MathematicsSumon Sdb
 
Introduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and AlgorithmsIntroduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and AlgorithmsNBER
 
Seminar- Robust Regression Methods
Seminar- Robust Regression MethodsSeminar- Robust Regression Methods
Seminar- Robust Regression MethodsSumon Sdb
 

En vedette (20)

Ridge regression, lasso and elastic net
Ridge regression, lasso and elastic netRidge regression, lasso and elastic net
Ridge regression, lasso and elastic net
 
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic NetsData Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
Data Science - Part XII - Ridge Regression, LASSO, and Elastic Nets
 
Predicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized Model
Predicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized ModelPredicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized Model
Predicting Short Term Movements of Stock Prices: A Two-Stage L1-Penalized Model
 
Signal Processing Course : Sparse Regularization of Inverse Problems
Signal Processing Course : Sparse Regularization of Inverse ProblemsSignal Processing Course : Sparse Regularization of Inverse Problems
Signal Processing Course : Sparse Regularization of Inverse Problems
 
Signal Processing Course : Convex Optimization
Signal Processing Course : Convex OptimizationSignal Processing Course : Convex Optimization
Signal Processing Course : Convex Optimization
 
Digital Signal Processing - Practical Techniques, Tips and Tricks Course Sampler
Digital Signal Processing - Practical Techniques, Tips and Tricks Course SamplerDigital Signal Processing - Practical Techniques, Tips and Tricks Course Sampler
Digital Signal Processing - Practical Techniques, Tips and Tricks Course Sampler
 
Study of volatility in foreign exchange market
Study of volatility in foreign exchange marketStudy of volatility in foreign exchange market
Study of volatility in foreign exchange market
 
Robotics: Modelling, Planning and Control
Robotics: Modelling, Planning and ControlRobotics: Modelling, Planning and Control
Robotics: Modelling, Planning and Control
 
Holistic slides lightning_testing_bof
Holistic slides lightning_testing_bofHolistic slides lightning_testing_bof
Holistic slides lightning_testing_bof
 
Sample data analysis_elmaddah
Sample data analysis_elmaddahSample data analysis_elmaddah
Sample data analysis_elmaddah
 
Open Risk Analysis Software - Data And Methodologies
Open Risk Analysis Software - Data And MethodologiesOpen Risk Analysis Software - Data And Methodologies
Open Risk Analysis Software - Data And Methodologies
 
structural equation modeling
structural equation modelingstructural equation modeling
structural equation modeling
 
A study on functioning of foreign exchange market
A study on functioning of foreign exchange marketA study on functioning of foreign exchange market
A study on functioning of foreign exchange market
 
Lasso
LassoLasso
Lasso
 
Seminar on Robust Regression Methods
Seminar on Robust Regression MethodsSeminar on Robust Regression Methods
Seminar on Robust Regression Methods
 
4thchannel conference poster_freedom_gumedze
4thchannel conference poster_freedom_gumedze4thchannel conference poster_freedom_gumedze
4thchannel conference poster_freedom_gumedze
 
A_Study_on_the_Medieval_Kerala_School_of_Mathematics
A_Study_on_the_Medieval_Kerala_School_of_MathematicsA_Study_on_the_Medieval_Kerala_School_of_Mathematics
A_Study_on_the_Medieval_Kerala_School_of_Mathematics
 
Introduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and AlgorithmsIntroduction to Supervised ML Concepts and Algorithms
Introduction to Supervised ML Concepts and Algorithms
 
Seminar- Robust Regression Methods
Seminar- Robust Regression MethodsSeminar- Robust Regression Methods
Seminar- Robust Regression Methods
 
Seminarppt
SeminarpptSeminarppt
Seminarppt
 

Similaire à C2.5

Mixed Integer Linear Programming Application in Real Estate Profit Optimization
Mixed Integer Linear Programming Application in Real Estate Profit Optimization	Mixed Integer Linear Programming Application in Real Estate Profit Optimization
Mixed Integer Linear Programming Application in Real Estate Profit Optimization Weiling Kang
 
Business Canvas and SWOT Analysis For mo.docx
             Business Canvas and SWOT Analysis   For mo.docx             Business Canvas and SWOT Analysis   For mo.docx
Business Canvas and SWOT Analysis For mo.docxhallettfaustina
 
Tomica - Week 1.pdfGBA 334 Application Assignment 4 Instru.docx
Tomica - Week 1.pdfGBA 334 Application Assignment 4 Instru.docxTomica - Week 1.pdfGBA 334 Application Assignment 4 Instru.docx
Tomica - Week 1.pdfGBA 334 Application Assignment 4 Instru.docxedwardmarivel
 
Secreti _GB Project9_7_16-1
Secreti _GB Project9_7_16-1Secreti _GB Project9_7_16-1
Secreti _GB Project9_7_16-1Robert Secreti
 

Similaire à C2.5 (7)

Mixed Integer Linear Programming Application in Real Estate Profit Optimization
Mixed Integer Linear Programming Application in Real Estate Profit Optimization	Mixed Integer Linear Programming Application in Real Estate Profit Optimization
Mixed Integer Linear Programming Application in Real Estate Profit Optimization
 
Architecture portfolio2
Architecture portfolio2Architecture portfolio2
Architecture portfolio2
 
Architecture portfolio2
Architecture portfolio2Architecture portfolio2
Architecture portfolio2
 
Architecture portfolio2
Architecture portfolio2Architecture portfolio2
Architecture portfolio2
 
Business Canvas and SWOT Analysis For mo.docx
             Business Canvas and SWOT Analysis   For mo.docx             Business Canvas and SWOT Analysis   For mo.docx
Business Canvas and SWOT Analysis For mo.docx
 
Tomica - Week 1.pdfGBA 334 Application Assignment 4 Instru.docx
Tomica - Week 1.pdfGBA 334 Application Assignment 4 Instru.docxTomica - Week 1.pdfGBA 334 Application Assignment 4 Instru.docx
Tomica - Week 1.pdfGBA 334 Application Assignment 4 Instru.docx
 
Secreti _GB Project9_7_16-1
Secreti _GB Project9_7_16-1Secreti _GB Project9_7_16-1
Secreti _GB Project9_7_16-1
 

Plus de Daniel LIAO (17)

C4.5
C4.5C4.5
C4.5
 
C4.4
C4.4C4.4
C4.4
 
C4.3.1
C4.3.1C4.3.1
C4.3.1
 
Lime
LimeLime
Lime
 
C4.1.2
C4.1.2C4.1.2
C4.1.2
 
C4.1.1
C4.1.1C4.1.1
C4.1.1
 
C3.7.1
C3.7.1C3.7.1
C3.7.1
 
C3.6.1
C3.6.1C3.6.1
C3.6.1
 
C3.5.1
C3.5.1C3.5.1
C3.5.1
 
C3.4.2
C3.4.2C3.4.2
C3.4.2
 
C3.4.1
C3.4.1C3.4.1
C3.4.1
 
C3.3.1
C3.3.1C3.3.1
C3.3.1
 
C3.1.2
C3.1.2C3.1.2
C3.1.2
 
C3.2.2
C3.2.2C3.2.2
C3.2.2
 
C3.2.1
C3.2.1C3.2.1
C3.2.1
 
C3.1.logistic intro
C3.1.logistic introC3.1.logistic intro
C3.1.logistic intro
 
C3.1 intro
C3.1 introC3.1 intro
C3.1 intro
 

Dernier

SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxAmanpreet Kaur
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Pooja Bhuva
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxDr. Sarita Anand
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfAdmir Softic
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfagholdier
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024Elizabeth Walsh
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...Poonam Aher Patil
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfSherif Taha
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfPoh-Sun Goh
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxDr. Ravikiran H M Gowda
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentationcamerronhm
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxRamakrishna Reddy Bijjam
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxPooja Bhuva
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.pptRamjanShidvankar
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxheathfieldcps1
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsTechSoup
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxAreebaZafar22
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the ClassroomPooky Knightsmith
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Association for Project Management
 

Dernier (20)

SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptxSKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
SKILL OF INTRODUCING THE LESSON MICRO SKILLS.pptx
 
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
Sensory_Experience_and_Emotional_Resonance_in_Gabriel_Okaras_The_Piano_and_Th...
 
Google Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptxGoogle Gemini An AI Revolution in Education.pptx
Google Gemini An AI Revolution in Education.pptx
 
Key note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdfKey note speaker Neum_Admir Softic_ENG.pdf
Key note speaker Neum_Admir Softic_ENG.pdf
 
Holdier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdfHoldier Curriculum Vitae (April 2024).pdf
Holdier Curriculum Vitae (April 2024).pdf
 
Spatium Project Simulation student brief
Spatium Project Simulation student briefSpatium Project Simulation student brief
Spatium Project Simulation student brief
 
FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024FSB Advising Checklist - Orientation 2024
FSB Advising Checklist - Orientation 2024
 
General Principles of Intellectual Property: Concepts of Intellectual Proper...
General Principles of Intellectual Property: Concepts of Intellectual  Proper...General Principles of Intellectual Property: Concepts of Intellectual  Proper...
General Principles of Intellectual Property: Concepts of Intellectual Proper...
 
Food safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdfFood safety_Challenges food safety laboratories_.pdf
Food safety_Challenges food safety laboratories_.pdf
 
Micro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdfMicro-Scholarship, What it is, How can it help me.pdf
Micro-Scholarship, What it is, How can it help me.pdf
 
REMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptxREMIFENTANIL: An Ultra short acting opioid.pptx
REMIFENTANIL: An Ultra short acting opioid.pptx
 
SOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning PresentationSOC 101 Demonstration of Learning Presentation
SOC 101 Demonstration of Learning Presentation
 
Python Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docxPython Notes for mca i year students osmania university.docx
Python Notes for mca i year students osmania university.docx
 
Interdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptxInterdisciplinary_Insights_Data_Collection_Methods.pptx
Interdisciplinary_Insights_Data_Collection_Methods.pptx
 
Application orientated numerical on hev.ppt
Application orientated numerical on hev.pptApplication orientated numerical on hev.ppt
Application orientated numerical on hev.ppt
 
The basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptxThe basics of sentences session 3pptx.pptx
The basics of sentences session 3pptx.pptx
 
Introduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The BasicsIntroduction to Nonprofit Accounting: The Basics
Introduction to Nonprofit Accounting: The Basics
 
ICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptxICT Role in 21st Century Education & its Challenges.pptx
ICT Role in 21st Century Education & its Challenges.pptx
 
Fostering Friendships - Enhancing Social Bonds in the Classroom
Fostering Friendships - Enhancing Social Bonds  in the ClassroomFostering Friendships - Enhancing Social Bonds  in the Classroom
Fostering Friendships - Enhancing Social Bonds in the Classroom
 
Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...Making communications land - Are they received and understood as intended? we...
Making communications land - Are they received and understood as intended? we...
 

C2.5

  • 1. Machine  Learning  Specializa0on   Lasso Regression: Regularization for feature selection Emily Fox & Carlos Guestrin Machine Learning Specialization University of Washington 1 ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 2. Machine  Learning  Specializa0on   Feature selection task ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 3. Machine  Learning  Specializa0on  3   Efficiency: -  If size(w) = 100B, each prediction is expensive -  If ŵ sparse , computation only depends on # of non-zeros Interpretability: -  Which features are relevant for prediction? Why might you want to perform feature selection? ©2015  Emily  Fox  &  Carlos  Guestrin  3 many zeros = X ˆwj 6=0 ŷi = ŵj hj(xi)
  • 4. Machine  Learning  Specializa0on  4   Sparsity: Housing application $ ? Lot  size   Single  Family   Year  built   Last  sold  price   Last  sale  price/sqI   Finished  sqI   Unfinished  sqI   Finished  basement  sqI   #  floors   Flooring  types   Parking  type   Parking  amount   Cooling   Hea0ng   Exterior  materials   Roof  type   Structure  style     Dishwasher   Garbage  disposal   Microwave   Range  /  Oven   Refrigerator   Washer   Dryer   Laundry  loca0on   Hea0ng  type   JeYed  Tub   Deck   Fenced  Yard   Lawn   Garden   Sprinkler  System       Lot  size   Single  Family   Year  built   Last  sold  price   Last  sale  price/sqI   Finished  sqI   Unfinished  sqI   Finished  basement  sqI   #  floors   Flooring  types   Parking  type   Parking  amount   Cooling   Hea0ng   Exterior  materials   Roof  type   Structure  style     Dishwasher   Garbage  disposal   Microwave   Range  /  Oven   Refrigerator   Washer   Dryer   Laundry  loca0on   Hea0ng  type   JeYed  Tub   Deck   Fenced  Yard   Lawn   Garden   Sprinkler  System       … ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 5. Machine  Learning  Specializa0on  5   Sparsity: Reading your mind ©2015  Emily  Fox  &  Carlos  Guestrin   very sad very happy Activity in which brain regions can predict happiness?  
  • 6. Machine  Learning  Specializa0on   Option 1: All subsets ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 7. Machine  Learning  Specializa0on  7   Find best model of size: 0 ©2015  Emily  Fox  &  Carlos  Guestrin   # of features RSS(ŵ) 0 -  # bedrooms -  # bathrooms -  sq.ft. living -  sq.ft. lot -  floors -  year built -  year renovated -  waterfront
  • 8. Machine  Learning  Specializa0on  8   Find best model of size: 1 ©2015  Emily  Fox  &  Carlos  Guestrin   # of features RSS(ŵ) 0 1 -  # bedrooms -  # bathrooms -  sq.ft. living -  sq.ft. lot -  floors -  year built -  year renovated -  waterfront
  • 9. Machine  Learning  Specializa0on  9   Find best model of size: 1 ©2015  Emily  Fox  &  Carlos  Guestrin   # of features RSS(ŵ) 0 1 -  # bedrooms -  # bathrooms -  sq.ft. living -  sq.ft. lot -  floors -  year built -  year renovated -  waterfront
  • 10. Machine  Learning  Specializa0on  10   Find best model of size: 1 ©2015  Emily  Fox  &  Carlos  Guestrin   # of features RSS(ŵ) 0 1 -  # bedrooms -  # bathrooms -  sq.ft. living -  sq.ft. lot -  floors -  year built -  year renovated -  waterfront
  • 11. Machine  Learning  Specializa0on  11   Find best model of size: 1 ©2015  Emily  Fox  &  Carlos  Guestrin   # of features RSS(ŵ) 0 1 -  # bedrooms -  # bathrooms -  sq.ft. living -  sq.ft. lot -  floors -  year built -  year renovated -  waterfront
  • 12. Machine  Learning  Specializa0on  12   Find best model of size: 1 ©2015  Emily  Fox  &  Carlos  Guestrin   # of features RSS(ŵ) 0 1 -  # bedrooms -  # bathrooms -  sq.ft. living -  sq.ft. lot -  floors -  year built -  year renovated -  waterfront
  • 13. Machine  Learning  Specializa0on  13   Find best model of size: 1 ©2015  Emily  Fox  &  Carlos  Guestrin   # of features RSS(ŵ) 0 1 -  # bedrooms -  # bathrooms -  sq.ft. living -  sq.ft. lot -  floors -  year built -  year renovated -  waterfront
  • 14. Machine  Learning  Specializa0on  14   Find best model of size: 1 ©2015  Emily  Fox  &  Carlos  Guestrin   # of features RSS(ŵ) 0 1 -  # bedrooms -  # bathrooms -  sq.ft. living -  sq.ft. lot -  floors -  year built -  year renovated -  waterfront
  • 15. Machine  Learning  Specializa0on  15   Find best model of size: 1 ©2015  Emily  Fox  &  Carlos  Guestrin   # of features RSS(ŵ) 0 1 -  # bedrooms -  # bathrooms -  sq.ft. living -  sq.ft. lot -  floors -  year built -  year renovated -  waterfront
  • 16. Machine  Learning  Specializa0on  16   Find best model of size: 1 ©2015  Emily  Fox  &  Carlos  Guestrin   # of features RSS(ŵ) 0 1 -  # bedrooms -  # bathrooms -  sq.ft. living -  sq.ft. lot -  floors -  year built -  year renovated -  waterfront
  • 17. Machine  Learning  Specializa0on  17   Find best model of size: 2 ©2015  Emily  Fox  &  Carlos  Guestrin   # of features RSS(ŵ) 0 1 2 -  # bedrooms -  # bathrooms -  sq.ft. living -  sq.ft. lot -  floors -  year built -  year renovated -  waterfront
  • 18. Machine  Learning  Specializa0on  18   Note: Not necessarily nested! ©2015  Emily  Fox  &  Carlos  Guestrin   # of features RSS(ŵ) 0 1 2 -  # bedrooms -  # bathrooms -  sq.ft. living -  sq.ft. lot -  floors -  year built -  year renovated -  waterfront
  • 19. Machine  Learning  Specializa0on  19   Note: Not necessarily nested! ©2015  Emily  Fox  &  Carlos  Guestrin   # of features RSS(ŵ) 0 1 2 -  # bedrooms -  # bathrooms -  sq.ft. living -  sq.ft. lot -  floors -  year built -  year renovated -  waterfront
  • 20. Machine  Learning  Specializa0on  20   Find best model of size: 3 ©2015  Emily  Fox  &  Carlos  Guestrin   # of features RSS(ŵ) 0 1 2 3 -  # bedrooms -  # bathrooms -  sq.ft. living -  sq.ft. lot -  floors -  year built -  year renovated -  waterfront
  • 21. Machine  Learning  Specializa0on  21   Find best model of size: 4 ©2015  Emily  Fox  &  Carlos  Guestrin   # of features RSS(ŵ) 0 1 2 3 4 -  # bedrooms -  # bathrooms -  sq.ft. living -  sq.ft. lot -  floors -  year built -  year renovated -  waterfront
  • 22. Machine  Learning  Specializa0on  22   Find best model of size: 5 ©2015  Emily  Fox  &  Carlos  Guestrin   # of features RSS(ŵ) 0 1 2 3 4 5 -  # bedrooms -  # bathrooms -  sq.ft. living -  sq.ft. lot -  floors -  year built -  year renovated -  waterfront
  • 23. Machine  Learning  Specializa0on  23   Find best model of size: 6 ©2015  Emily  Fox  &  Carlos  Guestrin   # of features RSS(ŵ) 0 1 2 3 4 5 6 -  # bedrooms -  # bathrooms -  sq.ft. living -  sq.ft. lot -  floors -  year built -  year renovated -  waterfront
  • 24. Machine  Learning  Specializa0on  24   Find best model of size: 7 ©2015  Emily  Fox  &  Carlos  Guestrin   # of features RSS(ŵ) 0 1 2 3 4 5 6 7 -  # bedrooms -  # bathrooms -  sq.ft. living -  sq.ft. lot -  floors -  year built -  year renovated -  waterfront
  • 25. Machine  Learning  Specializa0on  25   Find best model of size: 8 ©2015  Emily  Fox  &  Carlos  Guestrin   # of features RSS(ŵ) 0 1 2 3 4 5 6 7 8 -  # bedrooms -  # bathrooms -  sq.ft. living -  sq.ft. lot -  floors -  year built -  year renovated -  waterfront -  # bedrooms -  # bathrooms -  sq.ft. living -  sq.ft. lot -  floors -  year built -  year renovated -  waterfront
  • 26. Machine  Learning  Specializa0on  26   Choosing model complexity? Option 1: Assess on validation set Option 2: Cross validation Option 3+: Other metrics for penalizing model complexity like BIC… ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 27. Machine  Learning  Specializa0on  27   Complexity of “all subsets” How many models were evaluated? -  each indexed by features included ©2015  Emily  Fox  &  Carlos  Guestrin   yi = εi yi = w0h0(xi) + εi yi = w1 h1(xi) + εi yi = w0h0(xi) + w1 h1(xi) + εi yi = w0h0(xi) + w1 h1(xi) + … + wD hD(xi)+ εi …… [0 0 0 … 0 0 0] [1 0 0 … 0 0 0] [0 1 0 … 0 0 0] [1 1 0 … 0 0 0] [1 1 1 … 1 1 1] …… 2D 28 = 256 230 = 1,073,741,824 21000 = 1.071509 x 10301 2100B = HUGE!!!!!! Typically, computationally infeasible
  • 28. Machine  Learning  Specializa0on   Option 2: Greedy algorithms ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 29. Machine  Learning  Specializa0on  29   Forward stepwise algorithm 1.  Pick a dictionary of features {h0(x),…,hD(x)} - e.g., polynomials for linear regression 2.  Greedy heuristic: i.  Start with empty set of features F0 = ∅ (or simple set, like just h0(x)=1 à yi= w0+εi) ii.  Fit model using current feature set Ft to get ŵ(t) iii.  Select next best feature hj*(x) -  e.g., hj(x) resulting in lowest training error when learning with Ft + {hj(x)} iv.  Set Ft+1 ç Ft + {hj*(x)} v.  Recurse 29 ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 30. Machine  Learning  Specializa0on  30   Visualizing greedy procedure ©2015  Emily  Fox  &  Carlos  Guestrin   # of features RSS(ŵ) 0 1 2 3 4 5 6 7 8 -  # bedrooms -  # bathrooms -  sq.ft. living -  sq.ft. lot -  floors -  year built -  year renovated -  waterfront
  • 31. Machine  Learning  Specializa0on  31   Visualizing greedy procedure ©2015  Emily  Fox  &  Carlos  Guestrin   # of features RSS(ŵ) 0 1 2 3 4 5 6 7 8 -  # bedrooms -  # bathrooms -  sq.ft. living -  sq.ft. lot -  floors -  year built -  year renovated -  waterfront
  • 32. Machine  Learning  Specializa0on  32   Visualizing greedy procedure ©2015  Emily  Fox  &  Carlos  Guestrin   # of features RSS(ŵ) 0 1 2 3 4 5 6 7 8 -  # bedrooms -  # bathrooms -  sq.ft. living -  sq.ft. lot -  floors -  year built -  year renovated -  waterfront
  • 33. Machine  Learning  Specializa0on  33   Visualizing greedy procedure ©2015  Emily  Fox  &  Carlos  Guestrin   # of features RSS(ŵ) 0 1 2 3 4 5 6 7 8 -  # bedrooms -  # bathrooms -  sq.ft. living -  sq.ft. lot -  floors -  year built -  year renovated -  waterfront
  • 34. Machine  Learning  Specializa0on  34   -  # bedrooms -  # bathrooms -  sq.ft. living -  sq.ft. lot -  floors -  year built -  year renovated -  waterfront Visualizing greedy procedure ©2015  Emily  Fox  &  Carlos  Guestrin   # of features RSS(ŵ) 0 1 2 3 4 5 6 7 8
  • 35. Machine  Learning  Specializa0on  35   Visualizing greedy procedure ©2015  Emily  Fox  &  Carlos  Guestrin   # of features RSS(ŵ) 0 1 2 3 4 5 6 7 8 -  # bedrooms -  # bathrooms -  sq.ft. living -  sq.ft. lot -  floors -  year built -  year renovated -  waterfront
  • 36. Machine  Learning  Specializa0on  36   Visualizing greedy procedure ©2015  Emily  Fox  &  Carlos  Guestrin   # of features RSS(ŵ) 0 1 2 3 4 5 6 7 8 -  # bedrooms -  # bathrooms -  sq.ft. living -  sq.ft. lot -  floors -  year built -  year renovated -  waterfront error never increases + solutions eventually meet
  • 37. Machine  Learning  Specializa0on  37   When do we stop? When training error is low enough? When test error is low enough? Use validation set or cross validation! 37 ©2015  Emily  Fox  &  Carlos  Guestrin   No! No!
  • 38. Machine  Learning  Specializa0on  38   Complexity of forward stepwise How many models were evaluated? -  1st step, D models -  2nd step, D-1 models (add 1 feature out of D-1 possible) -  3rd step, D-2 models (add 1 feature out of D-2 possible) -  … How many steps? -  Depends -  At most D steps (to full model) ©2015  Emily  Fox  &  Carlos  Guestrin   O(D2) << 2D for large D
  • 39. Machine  Learning  Specializa0on  39   Other greedy algorithms Instead of starting from simple model and always growing… Backward stepwise: Start with full model and iteratively remove features least useful to fit Combining forward and backward steps: In forward algorithm, insert steps to remove features no longer as important Lots of other variants, too. 39 ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 40. Machine  Learning  Specializa0on   Option 3: Regularize ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 41. Machine  Learning  Specializa0on  41   Ridge regression: L2 regularized regression Total cost = measure of fit + λ measure of magnitude of coefficients ©2015  Emily  Fox  &  Carlos  Guestrin   RSS(w) ||w||2=w0 2+…+wD 2 2   Encourages small weights but not exactly 0
  • 42. Machine  Learning  Specializa0on  42   0 50000 100000 150000 200000 −1000000100000200000 coefficient paths −− Ridge coefficient bedrooms bathrooms sqft_living sqft_lot floors yr_built yr_renovat waterfront Coefficient path – ridge ©2015  Emily  Fox  &  Carlos  Guestrin   λ coefficientsŵj
  • 43. Machine  Learning  Specializa0on  43   Efficiency: -  If size(w) = 100B, each prediction is expensive -  If ŵ sparse , computation only depends on # of non-zeros Interpretability: -  Which features are relevant for prediction? Recall sparsity (many ŵj=0) gives efficiency and interpretability ©2015  Emily  Fox  &  Carlos  Guestrin   many zeros = X ˆwj 6=0 ŷi = ŵj hj(xi)
  • 44. Machine  Learning  Specializa0on  44   Using regularization for feature selection Instead of searching over a discrete set of solutions, can we use regularization? - Start with full model (all possible features) - “Shrink” some coefficients exactly to 0 •  i.e., knock out certain features - Non-zero coefficients indicate “selected” features ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 45. Machine  Learning  Specializa0on  45   Thresholding ridge coefficients? Why don’t we just set small ridge coefficients to 0? ©2015  Emily  Fox  &  Carlos  Guestrin   0
  • 46. Machine  Learning  Specializa0on  46   Thresholding ridge coefficients? Selected features for a given threshold value ©2015  Emily  Fox  &  Carlos  Guestrin   0
  • 47. Machine  Learning  Specializa0on  47   Thresholding ridge coefficients? Let’s look at two related features… ©2015  Emily  Fox  &  Carlos  Guestrin   0 Nothing measuring bathrooms was included!
  • 48. Machine  Learning  Specializa0on  48   Thresholding ridge coefficients? If only one of the features had been included… ©2015  Emily  Fox  &  Carlos  Guestrin   0
  • 49. Machine  Learning  Specializa0on  49   Thresholding ridge coefficients? Would have included bathrooms in selected model ©2015  Emily  Fox  &  Carlos  Guestrin   0 Can regularization lead directly to sparsity?
  • 50. Machine  Learning  Specializa0on  50   Try this cost instead of ridge… Total cost = measure of fit + λ measure of magnitude of coefficients ©2015  Emily  Fox  &  Carlos  Guestrin   RSS(w) ||w||1=|w0|+…+|wD| Lasso regression (a.k.a. L1 regularized regression) Leads to sparse solutions!
  • 51. Machine  Learning  Specializa0on  51   Lasso regression: L1 regularized regression Just like ridge regression, solution is governed by a continuous parameter λ If λ=0: If λ=∞:   If λ in between: RSS(w) + λ||w||1 tuning parameter = balance of fit and sparsity ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 52. Machine  Learning  Specializa0on  52   0 50000 100000 150000 200000 −1000000100000200000 coefficient paths −− Ridge coefficient bedrooms bathrooms sqft_living sqft_lot floors yr_built yr_renovat waterfront Coefficient path – ridge ©2015  Emily  Fox  &  Carlos  Guestrin   λ coefficientsŵj
  • 53. Machine  Learning  Specializa0on  53   0 50000 100000 150000 200000 −1000000100000200000 coefficient paths −− LASSO coefficient bedrooms bathrooms sqft_living sqft_lot floors yr_built yr_renovat waterfront Coefficient path – lasso ©2015  Emily  Fox  &  Carlos  Guestrin   λ coefficientsŵj
  • 54. Machine  Learning  Specializa0on   Geometric intuition for sparsity of lasso solution ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 55. Machine  Learning  Specializa0on   Geometric intuition for ridge regression 55 ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 56. Machine  Learning  Specializa0on  56   Visualizing the ridge cost in 2D ©2015  Emily  Fox  &  Carlos  Guestrin   NX i=1 w0 w1 RSS Cost −10 −5 0 5 10 −10 −5 0 5 10 RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi))2 + λ (w0 2+w1 2)2  
  • 57. Machine  Learning  Specializa0on  57   Visualizing the ridge cost in 2D ©2015  Emily  Fox  &  Carlos  Guestrin   NX i=1 w0 w1 L2 penalty −10 −5 0 5 10 −10 −5 0 5 10 RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi))2 + λ (w0 2+w1 2)2  
  • 58. Machine  Learning  Specializa0on  58   Visualizing the ridge cost in 2D ©2015  Emily  Fox  &  Carlos  Guestrin   NX i=1 RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi))2 + λ (w0 2+w1 2)2  
  • 59. Machine  Learning  Specializa0on  59   Visualizing the ridge solution ©2015  Emily  Fox  &  Carlos  Guestrin   NX i=1 5215 5215 5215 5215 5215 5215 5215 w0 w1 4.75 4.75 level sets intersect −10 −5 0 5 10 −10 −5 0 5 10 RSS(w) + λ||w||2 = (yi-w0h0(xi)-w1h1(xi))2 + λ (w0 2+w1 2)2  
  • 60. Machine  Learning  Specializa0on   Geometric intuition for lasso 60 ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 61. Machine  Learning  Specializa0on  61   Visualizing the lasso cost in 2D ©2015  Emily  Fox  &  Carlos  Guestrin   RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi))2 + λ (|w0|+|w1|) NX i=1 w0 w1 RSS Cost −10 −5 0 5 10 −10 −5 0 5 10
  • 62. Machine  Learning  Specializa0on  62   Visualizing the lasso cost in 2D ©2015  Emily  Fox  &  Carlos  Guestrin   RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi))2 + λ (|w0|+|w1|) NX i=1 w0 w1 L1 penalty −10 −5 0 5 10 −10 −5 0 5 10
  • 63. Machine  Learning  Specializa0on  63   Visualizing the lasso cost in 2D ©2015  Emily  Fox  &  Carlos  Guestrin   RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi))2 + λ (|w0|+|w1|) NX i=1
  • 64. Machine  Learning  Specializa0on  64   Visualizing the lasso solution ©2015  Emily  Fox  &  Carlos  Guestrin   RSS(w) + λ||w||1 = (yi-w0h0(xi)-w1h1(xi))2 + λ (|w0|+|w1|) NX i=1 5215 5215 5215 5215 5215 5215 5215 w0 w1 2.75 2.75 level sets intersect −10 −5 0 5 10 −10 −5 0 5 10
  • 65. Machine  Learning  Specializa0on  65   Revisit polynomial fit demo What happens if we refit our high-order polynomial, but now using lasso regression? Will consider a few settings of λ … ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 66. Machine  Learning  Specializa0on   Fitting the lasso regression model (for given λ value) ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 67. Machine  Learning  Specializa0on  67   ©2015  Emily  Fox  &  Carlos  Guestrin   Training Data Feature extraction ML model Quality metric ML algorithm ŵy ŷh(x)
  • 68. Machine  Learning  Specializa0on  68   How we optimized past objectives To solve for ŵ, previously took gradient of total cost objective and either: 1)  Derived closed-form solution 2)  Used in gradient descent algorithm ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 69. Machine  Learning  Specializa0on  69   Optimizing the lasso objective Lasso total cost: Issues: 1)  What’s the derivative of |wj|? 2)  Even if we could compute derivative, no closed-form solution ©2015  Emily  Fox  &  Carlos  Guestrin   RSS(w) + ||w||1λ gradients à subgradients can use subgradient descent
  • 70. Machine  Learning  Specializa0on   Aside 1: Coordinate descent ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 71. Machine  Learning  Specializa0on  71   Coordinate descent Goal: Minimize some function g Often, hard to find minimum for all coordinates, but easy for each coordinate Coordinate descent: Initialize ŵ = 0 (or smartly…) while not converged pick a coordinate j ŵj ß ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 72. Machine  Learning  Specializa0on  72   Comments on coordinate descent How do we pick next coordinate? -  At random (“random” or “stochastic” coordinate descent), round robin, … No stepsize to choose! Super useful approach for many problems -  Converges to optimum in some cases (e.g., “strongly convex”) -  Converges for lasso objective ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 73. Machine  Learning  Specializa0on   Aside 2: Normalizing features ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 74. Machine  Learning  Specializa0on  74   p Normalizing features Scale training columns (not rows!) as: Apply same training scale factors to test data: ©2015  Emily  Fox  &  Carlos  Guestrin   hj(xk) = hj(xk) hj(xi)2 NX i=1 summing over training points apply to test point Training features Test features Normalizer: zj Normalizer: zj … hj(xk) = hj(xk) p hj(xi)2 NX i=1
  • 75. Machine  Learning  Specializa0on   Aside 3: Coordinate descent for unregularized regression (for normalized features) ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 76. Machine  Learning  Specializa0on  76   Optimizing least squares objective one coordinate at a time Fix all coordinates w-j and take partial w.r.t. wj ©2015  Emily  Fox  &  Carlos  Guestrin   RSS(w) = (yi- wjhj(xi))2 NX i=1 DX j=0 NX i=1 RSS(w) = -2 hj(xi)(yi – wjhj(xi)) ∂ ∂wj DX j=0
  • 77. Machine  Learning  Specializa0on  77   Optimizing least squares objective one coordinate at a time Set partial = 0 and solve ©2015  Emily  Fox  &  Carlos  Guestrin   RSS(w) = (yi- wjhj(xi))2 NX i=1 DX j=0 RSS(w) = -2ρj + 2wj = 0 ∂ ∂wj
  • 78. Machine  Learning  Specializa0on  78   Coordinate descent for least squares regression Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D compute: set: ŵj = ρj ©2015  Emily  Fox  &  Carlos  Guestrin   NX i=1 ρj = hj(xi)(yi – ŷi(ŵ-j)) prediction without feature j residual without feature j
  • 79. Machine  Learning  Specializa0on   Coordinate descent for lasso (for normalized features) ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 80. Machine  Learning  Specializa0on  80   Coordinate descent for least squares regression Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D compute: set: ŵj = ρj ©2015  Emily  Fox  &  Carlos  Guestrin   NX i=1 ρj = hj(xi)(yi – ŷi(ŵ-j)) prediction without feature j residual without feature j
  • 81. Machine  Learning  Specializa0on  81   Coordinate descent for lasso Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D compute: set: ŵj = ©2015  Emily  Fox  &  Carlos  Guestrin   NX i=1 ρj = hj(xi)(yi – ŷi(ŵ-j)) ρj + λ/2 if ρj < -λ/2 ρj – λ/2 if ρj > λ/2 0 if ρj in [-λ/2, λ/2]
  • 82. Machine  Learning  Specializa0on  82   Soft thresholding ©2015  Emily  Fox  &  Carlos  Guestrin   ŵj ρj ŵj = ρj + λ/2 if ρj < -λ/2 ρj – λ/2 if ρj > λ/2 0 if ρj in [-λ/2, λ/2]
  • 83. Machine  Learning  Specializa0on  83   How to assess convergence? Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D compute: set: ŵj = ©2015  Emily  Fox  &  Carlos  Guestrin   NX i=1 ρj = hj(xi)(yi – ŷi(ŵ-j)) ρj + λ/2 if ρj < -λ/2 ρj – λ/2 if ρj > λ/2 0 if ρj in [-λ/2, λ/2]
  • 84. Machine  Learning  Specializa0on  84   When to stop? For convex problems, will start to take smaller and smaller steps Measure size of steps taken in a full loop over all features -  stop when max step < ε Convergence criteria ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 85. Machine  Learning  Specializa0on  85   Other lasso solvers ©2015  Emily  Fox  &  Carlos  Guestrin   Classically: Least angle regression (LARS) [Efron et al. ‘04] Then: Coordinate descent algorithm [Fu ‘98, Friedman, Hastie, & Tibshirani ’08] Now: •  Parallel CD (e.g., Shotgun, [Bradley et al. ‘11]) •  Other parallel learning approaches for linear models -  Parallel stochastic gradient descent (SGD) (e.g., Hogwild! [Niu et al. ’11]) -  Parallel independent solutions then averaging [Zhang et al. ‘12] •  Alternating directions method of multipliers (ADMM) [Boyd et al. ’11]
  • 86. Machine  Learning  Specializa0on   Coordinate descent for lasso (for unnormalized features) ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 87. Machine  Learning  Specializa0on  87   Coordinate descent for lasso with normalized features Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D compute: set: ŵj = ©2015  Emily  Fox  &  Carlos  Guestrin   NX i=1 ρj = hj(xi)(yi – ŷi(ŵ-j)) ρj + λ/2 if ρj < -λ/2 ρj – λ/2 if ρj > λ/2 0 if ρj in [-λ/2, λ/2]
  • 88. Machine  Learning  Specializa0on  88   Coordinate descent for lasso with unnormalized features Precompute: Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D compute: set: ŵj = ©2015  Emily  Fox  &  Carlos  Guestrin   NX i=1 zj = hj(xi)2 NX i=1 ρj = hj(xi)(yi – ŷi(ŵ-j)) (ρj + λ/2)/zj if ρj < -λ/2 (ρj – λ/2)/zj if ρj > λ/2 0 if ρj in [-λ/2, λ/2]
  • 89. Machine  Learning  Specializa0on   Deriving the lasso coordinate descent update ©2015  Emily  Fox  &  Carlos  Guestrin   OPTIONAL
  • 90. Machine  Learning  Specializa0on  90   Optimizing lasso objective one coordinate at a time Fix all coordinates w-j and take partial w.r.t. wj ©2015  Emily  Fox  &  Carlos  Guestrin   RSS(w) + λ||w||1 = (yi- wjhj(xi))2 + λ |wj| NX i=1 DX j=0 DX j=0 derive without normalizing features
  • 91. Machine  Learning  Specializa0on  91   Part 1: Partial of RSS term ©2015  Emily  Fox  &  Carlos  Guestrin   RSS(w) + λ||w||1 = (yi- wjhj(xi))2 + λ |wj| NX i=1 DX j=0 DX j=0 NX i=1 RSS(w) = -2 hj(xi)(yi – wjhj(xi)) ∂ ∂wj DX j=0
  • 92. Machine  Learning  Specializa0on  92   Part 2: Partial of L1 penalty term ©2015  Emily  Fox  &  Carlos  Guestrin   RSS(w) + λ||w||1 = (yi- wjhj(xi))2 + λ |wj| NX i=1 DX j=0 DX j=0 λ |wj| = ???∂ ∂wj |x| x
  • 93. Machine  Learning  Specializa0on  93   Subgradients of convex functions Gradients lower bound convex functions: Subgradients: Generalize gradients to non-differentiable points: -  Any plane that lower bounds function ©2015  Emily  Fox  &  Carlos  Guestrin   ba unique at x if function differentiable at x g(x) x |x| x
  • 94. Machine  Learning  Specializa0on  94   Part 2: Subgradient of L1 term ©2015  Emily  Fox  &  Carlos  Guestrin   RSS(w) + λ||w||1 = (yi- wjhj(xi))2 + λ |wj| NX i=1 DX j=0 DX j=0 λ∂      |wj| =wj |wj| wj -λ when wj < 0 λ when wj > 0 [-λ,λ] when wj = 0
  • 95. Machine  Learning  Specializa0on  95   Putting it all together… ©2015  Emily  Fox  &  Carlos  Guestrin   RSS(w) + λ||w||1 = (yi- wjhj(xi))2 + λ |wj| NX i=1 DX j=0 DX j=0 ∂      [lasso cost] = 2zjwj – 2ρj +wj -λ when wj < 0 λ when wj > 0 [-λ, λ] when wj = 0 2zjwj – 2ρj – λ when wj < 0 2zjwj – 2ρj + λ when wj > 0 [-2ρj-λ, -2ρj+λ] when wj = 0=  
  • 96. Machine  Learning  Specializa0on  96   Optimal solution: Set subgradient = 0 ©2015  Emily  Fox  &  Carlos  Guestrin   ∂      [lasso cost] = = 0wj Case 1 (wj < 0): Case 2 (wj = 0): Case 3 (wj > 0): 2zjŵj – 2ρj – λ = 0 ŵj = 0 For ŵj < 0, need For ŵj = 0, need [-2ρj-λ, -2ρj+λ] to contain 0: 2zjŵj – 2ρj + λ = 0 For ŵj > 0, need 2zjwj – 2ρj – λ when wj < 0 2zjwj – 2ρj + λ when wj > 0 [-2ρj-λ, -2ρj+λ] when wj = 0
  • 97. Machine  Learning  Specializa0on  97   Optimal solution: Set subgradient = 0 ©2015  Emily  Fox  &  Carlos  Guestrin   ∂      [lasso cost] = = 0wj (ρj + λ/2)/zj if ρj < -λ/2 (ρj – λ/2)/zj if ρj > λ/2 0 if ρj in [-λ/2, λ/2]ŵj = 2zjwj – 2ρj – λ when wj < 0 2zjwj – 2ρj + λ when wj > 0 [-2ρj-λ, -2ρj+λ] when wj = 0
  • 98. Machine  Learning  Specializa0on  98   ©2015  Emily  Fox  &  Carlos  Guestrin   ŵj ρj (ρj + λ/2)/zj if ρj < -λ/2 (ρj – λ/2)/zj if ρj > λ/2 0 if ρj in [-λ/2, λ/2]ŵj = Soft thresholding
  • 99. Machine  Learning  Specializa0on  99   Coordinate descent for lasso Precompute: Initialize ŵ = 0 (or smartly…) while not converged for j=0,1,…,D compute: set: ŵj = ©2015  Emily  Fox  &  Carlos  Guestrin   NX i=1 zj = hj(xi)2 NX i=1 ρj = hj(xi)(yi – ŷi(ŵ-j)) (ρj + λ/2)/zj if ρj < -λ/2 (ρj – λ/2)/zj if ρj > λ/2 0 if ρj in [-λ/2, λ/2]
  • 100. Machine  Learning  Specializa0on   How to choose λ ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 101. Machine  Learning  Specializa0on  101   If sufficient amount of data… ©2015  Emily  Fox  &  Carlos  Guestrin   Validation set Training set Test set fit ŵλ test performance of ŵλ to select λ* assess generalization error of ŵλ*
  • 102. Machine  Learning  Specializa0on  102   For k=1,…,K 1.  Estimate ŵλ (k) on the training blocks 2.  Compute error on validation block: errork(λ) Compute average error: CV(λ) = errork(λ) ©2015  Emily  Fox  &  Carlos  Guestrin   Valid set ŵλ (5)   error5(λ) 1 K KX k=1 K-fold cross validation
  • 103. Machine  Learning  Specializa0on  103   Choosing λ via cross validation Cross validation is choosing the λ that provides best predictive accuracy Tends to favor less sparse solutions, and thus smaller λ, than optimal choice for feature selection c.f., “Machine Learning: A Probabilistic Perspective”, Murphy, 2012 for further discussion ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 104. Machine  Learning  Specializa0on   Practical concerns with lasso ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 105. Machine  Learning  Specializa0on  105   Debiasing lasso Lasso shrinks coefficients relative to LS solution à  more bias, less variance Can reduce bias as follows: 1.  Run lasso to select features 2.  Run least squares regression with only selected features “Relevant” features no longer shrunk relative to LS fit of same reduced model ©2015  Emily  Fox  &  Carlos  Guestrin   Figure used with permission of Mario Figueiredo (captions modified to fit course) True coefficients (D=4096, non-zero = 160) L1 reconstruction (non-zero = 1024, MSE = 0.0072) Debiased (non-zero = 1024, MSE = 3.26e-005) Least squares (non-zero = 0, MSE = 1.568)
  • 106. Machine  Learning  Specializa0on  106   Issues with standard lasso objective 1.  With group of highly correlated features, lasso tends to select amongst them arbitrarily -  Often prefer to select all together 2.  Often, empirically ridge has better predictive performance than lasso, but lasso leads to sparser solution Elastic net aims to address these issues -  hybrid between lasso and ridge regression -  uses L1 and L2 penalties See Zou & Hastie ‘05 for further discussion ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 107. Machine  Learning  Specializa0on   Summary for feature selection and lasso regression ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 108. Machine  Learning  Specializa0on  108   Impact of feature selection and lasso Lasso has changed machine learning, statistics, & electrical engineering But, for feature selection in general, be careful about interpreting selected features - selection only considers features included - sensitive to correlations between features - result depends on algorithm used - there are theoretical guarantees for lasso under certain conditions ©2015  Emily  Fox  &  Carlos  Guestrin  
  • 109. Machine  Learning  Specializa0on  109   What you can do now… •  Perform feature selection using “all subsets” and “forward stepwise” algorithms •  Analyze computational costs of these algorithms •  Contrast greedy and optimal algorithms •  Formulate lasso objective •  Describe what happens to estimated lasso coefficients as tuning parameter λ is varied •  Interpret lasso coefficient path plot •  Contrast ridge and lasso regression •  Describe geometrically why L1 penalty leads to sparsity •  Estimate lasso regression parameters using an iterative coordinate descent algorithm •  Implement K-fold cross validation to select lasso tuning parameter λ ©2015  Emily  Fox  &  Carlos  Guestrin