Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Learning Theory  Put to Work  Isabelle Guyon isabelle @ clopinet .com
What is the process of Data Mining / Machine Learning? <ul><li>Learning </li></ul><ul><li>algorithm </li></ul>TRAINING DAT...
For which tasks ? <ul><li>Classification   (binary/categorical target) </li></ul><ul><li>Regression   and   time series pr...
For which applications ? Bioinformatics Quality control Machine vision Customer knowledge inputs training examples 10 10 2...
Banking / Telecom / Retail <ul><li>Identify: </li></ul><ul><ul><li>Prospective customers </li></ul></ul><ul><ul><li>Dissat...
Biomedical / Biometrics <ul><li>Medicine: </li></ul><ul><ul><li>Screening </li></ul></ul><ul><ul><li>Diagnosis and prognos...
Computer / Internet <ul><li>Computer interfaces: </li></ul><ul><ul><li>Troubleshooting wizards  </li></ul></ul><ul><ul><li...
From Statistics to Machine Learning… and back! <ul><li>Old textbook statistics were  descriptive   : </li></ul><ul><ul><li...
Some Learning Machines <ul><li>Linear models </li></ul><ul><li>Polynomial models  </li></ul><ul><li>Kernel methods </li></...
Conventions X={x ij } n attributes/features m samples /customers /patients x i y  ={y j }  w
Linear Models <ul><li>  f( x ) =   j=1:n  w j  x j  + b   </li></ul><ul><li>Linear discriminant  (for classification): </...
Non-linear models <ul><li>Linear models (artificial neurons) </li></ul><ul><li>f( x ) =   j=1:n  w j  x j  + b   </li></u...
Linear Decision Boundary hyperplane x 1 x 2 f( x ) = 0 f( x ) > 0 f( x ) < 0
NL Decision Boundary x 1 x 2 f( x ) = 0 f( x ) > 0 f( x ) < 0
Fit / Robustness Tradeoff x 1 x 2 x 1 x 2
Performance Assessment False alarm rate =   type I errate = 1-specificity   Hit rate = 1-type II errate = sensitivity = re...
ROC Curve False alarm rate = 1 - Specificity Hit rate = Sensitivity Ideal ROC curve (AUC=1) 100% 100% Patients diagnosed b...
Lift Curve O M Fraction of customers selected Hit rate = Frac.  good  customers select . Random lift Ideal Lift 100% 100% ...
What is a Risk Functional? <ul><li>A function of the parameters of the learning machine, assessing how much it is expected...
How to train? <ul><li>Define a risk functional R[f( x , w )] </li></ul><ul><li>Optimize it w.r.t.  w  (gradient descent, m...
Theoretical Foundations <ul><li>Structural Risk Minimization </li></ul><ul><li>Regularization </li></ul><ul><li>Weight dec...
Ockham’s Razor  <ul><li>Principle proposed by William of Ockham in the fourteenth century: “ Pluralitas non est ponenda si...
Risk Minimization <ul><li>Examples are given : </li></ul><ul><li>( x 1 , y 1 ), ( x 2 , y 2 ), … ( x m , y m ) </li></ul><...
Approximations of R[f] <ul><li>Empirical risk :  R train [f]  =  (1/n)    i=1:m  L(f( x i ;  w ), y i ) </li></ul><ul><ul...
Structural Risk Minimization Vapnik, 1974 S 3 S 2 S 1 Increasing complexity Nested subsets of models, increasing complexit...
SRM Example <ul><li>Rank with  || w || 2  =   i  w i 2 </li></ul><ul><li>S k  = {  w  |  || w || 2   <   k 2  },   1 < ...
Multiple Structures <ul><li>Shrinkage (weight decay, ridge regression, SVM):   </li></ul><ul><li>S k  = {  w  |  || w || 2...
Hyper-parameter Selection <ul><li>Learning = adjusting :   </li></ul><ul><li>parameters   ( w  vector ) .  </li></ul><ul><...
Summary <ul><li>SRM provides a theoretical framework for robust predictive modeling (overfitting avoidance), using the not...
KXEN (simplified) architecture D a t a P r e p a r a t i o n Learning  Algorithm Class of Models  D a t a E n c o d i n g ...
KXEN: SRM put to work O M Fraction of customers selected Fraction of good customers selected Random lift Ideal Lift 100% 1...
Want to Learn More? <ul><li>Statistical Learning Theory ,  V. Vapnik.  Theoretical book. Reference book on generatization,...
Prochain SlideShare
Chargement dans…5
×

isabelle_webinar_jan..

417 vues

Publié le

  • Soyez le premier à commenter

isabelle_webinar_jan..

  1. 1. Learning Theory Put to Work Isabelle Guyon isabelle @ clopinet .com
  2. 2. What is the process of Data Mining / Machine Learning? <ul><li>Learning </li></ul><ul><li>algorithm </li></ul>TRAINING DATA ? Answer Trained machine Query
  3. 3. For which tasks ? <ul><li>Classification (binary/categorical target) </li></ul><ul><li>Regression and time series prediction (continuous targets) </li></ul><ul><li>Clustering (targets unknown) </li></ul><ul><li>Rule discovery </li></ul>
  4. 4. For which applications ? Bioinformatics Quality control Machine vision Customer knowledge inputs training examples 10 10 2 10 3 10 4 10 5 OCR HWR Market Analysis Text Categorization System diagnosis 10 10 2 10 3 10 4 10 5 10 6
  5. 5. Banking / Telecom / Retail <ul><li>Identify: </li></ul><ul><ul><li>Prospective customers </li></ul></ul><ul><ul><li>Dissatisfied customers </li></ul></ul><ul><ul><li>Good customers </li></ul></ul><ul><ul><li>Bad payers </li></ul></ul><ul><li>Obtain: </li></ul><ul><ul><li>More effective advertising </li></ul></ul><ul><ul><li>Less credit risk </li></ul></ul><ul><ul><li>Fewer fraud </li></ul></ul><ul><ul><li>Decreased churn rate </li></ul></ul>
  6. 6. Biomedical / Biometrics <ul><li>Medicine: </li></ul><ul><ul><li>Screening </li></ul></ul><ul><ul><li>Diagnosis and prognosis </li></ul></ul><ul><ul><li>Drug discovery </li></ul></ul><ul><li>Security: </li></ul><ul><ul><li>Face recognition </li></ul></ul><ul><ul><li>Signature / fingerprint / iris verification </li></ul></ul><ul><ul><li>DNA fingerprinting </li></ul></ul>
  7. 7. Computer / Internet <ul><li>Computer interfaces: </li></ul><ul><ul><li>Troubleshooting wizards </li></ul></ul><ul><ul><li>Handwriting and speech </li></ul></ul><ul><ul><li>Brain waves </li></ul></ul><ul><li>Internet </li></ul><ul><ul><li>Hit ranking </li></ul></ul><ul><ul><li>Spam filtering </li></ul></ul><ul><ul><li>Text categorization </li></ul></ul><ul><ul><li>Text translation </li></ul></ul><ul><ul><li>Recommendation </li></ul></ul>
  8. 8. From Statistics to Machine Learning… and back! <ul><li>Old textbook statistics were descriptive : </li></ul><ul><ul><li>Mean, variance </li></ul></ul><ul><ul><li>Confidence intervals </li></ul></ul><ul><ul><li>Statistical tests </li></ul></ul><ul><ul><li>Fit data, discover distributions ( past data ) </li></ul></ul><ul><li>Machine learning (1960’s) is predictive : </li></ul><ul><ul><li>Training / validation / test sets </li></ul></ul><ul><ul><li>Build robust predictive models ( future data ) </li></ul></ul><ul><li>Learning theory (1990’s) : </li></ul><ul><ul><li>Rigorous statistical framework for ML </li></ul></ul><ul><ul><li>Proper monitoring of fit vs. robustness </li></ul></ul>
  9. 9. Some Learning Machines <ul><li>Linear models </li></ul><ul><li>Polynomial models </li></ul><ul><li>Kernel methods </li></ul><ul><li>Neural networks </li></ul><ul><li>Decision trees </li></ul>
  10. 10. Conventions X={x ij } n attributes/features m samples /customers /patients x i y ={y j }  w
  11. 11. Linear Models <ul><li> f( x ) =  j=1:n w j x j + b </li></ul><ul><li>Linear discriminant (for classification): </li></ul><ul><li>F( x ) = 1 if f( x )>0 </li></ul><ul><li>F( x ) = -1 if f( x )  0 </li></ul>LINEAR = WEIGHTED SUM
  12. 12. Non-linear models <ul><li>Linear models (artificial neurons) </li></ul><ul><li>f( x ) =  j=1:n w j x j + b </li></ul><ul><li>Models non-linear in their inputs , but linear in their parameters </li></ul><ul><li>f( x ) =  j=1:N w j  j ( x ) + b (Perceptron) </li></ul><ul><li>f( x ) =  i=1:m  i k ( x i , x ) + b (Kernel method) </li></ul><ul><li>Other non-linear models </li></ul><ul><li>Neural networks / multi-layer perceptrons </li></ul><ul><li>Decision trees </li></ul>
  13. 13. Linear Decision Boundary hyperplane x 1 x 2 f( x ) = 0 f( x ) > 0 f( x ) < 0
  14. 14. NL Decision Boundary x 1 x 2 f( x ) = 0 f( x ) > 0 f( x ) < 0
  15. 15. Fit / Robustness Tradeoff x 1 x 2 x 1 x 2
  16. 16. Performance Assessment False alarm rate = type I errate = 1-specificity Hit rate = 1-type II errate = sensitivity = recall = test power <ul><li>Compare F(x) = sign(f(x)) to the target y, and report: </li></ul><ul><ul><li>Error rate = ( fn + fp )/m </li></ul></ul><ul><ul><li>{ Hit rate , False alarm rate } or { Hit rate , Precision} or { Hit rate , Frac.selected} </li></ul></ul><ul><ul><li>Balanced error rate (BER) = ( fn/pos + fp/neg )/2 = 1 – ( sensitivity + specificity )/2 </li></ul></ul><ul><ul><li>F measure = 2 precision. recall /(precision+ recall ) </li></ul></ul><ul><ul><li>Vary the decision threshold  in F(x) = sign(f(x)+  ), and plot: </li></ul></ul><ul><ul><ul><li>ROC curve : Hit rate vs. False alarm rate </li></ul></ul></ul><ul><ul><ul><li>Lift curve : Hit rate vs. Fraction selected </li></ul></ul></ul><ul><ul><ul><li>Precision/recall curve : Hit rate vs. Precision </li></ul></ul></ul>Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp Cost matrix Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp neg=tn+fp Total pos=fn+tp sel =fp+tp rej=tn+fn Total m=tn+fp +fn+tp False alarm = fp/neg Class +1 / Total Hit rate = tp/pos Frac. selected = sel/m Cost matrix Class+1 /Total Precision = tp/sel Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp neg=tn+fp Total pos=fn+tp sel =fp+tp rej=tn+fn Total m=tn+fp +fn+tp False alarm = fp/neg Class +1 / Total Hit rate = tp/pos Frac. selected = sel/m Cost matrix Predictions: F(x) Class -1 Class +1 Truth: y Class -1 tn fp Class +1 fn tp neg=tn+fp Total pos=fn+tp sel =fp+tp rej=tn+fn Total m=tn+fp +fn+tp Cost matrix
  17. 17. ROC Curve False alarm rate = 1 - Specificity Hit rate = Sensitivity Ideal ROC curve (AUC=1) 100% 100% Patients diagnosed by putting a threshold on f(x). For a given threshold you get a point on the ROC curve. 0  AUC  1 Actual ROC Random ROC (AUC=0.5) 0
  18. 18. Lift Curve O M Fraction of customers selected Hit rate = Frac. good customers select . Random lift Ideal Lift 100% 100% Customers ranked according to f(x); selection of the top ranking customers. Gini=2 AUC-1 0  Gini  1 Actual Lift 0
  19. 19. What is a Risk Functional? <ul><li>A function of the parameters of the learning machine, assessing how much it is expected to fail on a given task. </li></ul><ul><li>Examples: </li></ul><ul><li>Classification: </li></ul><ul><ul><li>Error rate: (1/m)  i=1:m 1 (F( x i )  y i ) </li></ul></ul><ul><ul><li>1- AUC (Gini Index = 2 AUC-1) </li></ul></ul><ul><li>Regression: </li></ul><ul><ul><li>Mean square error: (1/m)  i=1:m (f( x i )-y i ) 2 </li></ul></ul>
  20. 20. How to train? <ul><li>Define a risk functional R[f( x , w )] </li></ul><ul><li>Optimize it w.r.t. w (gradient descent, mathematical programming, simulated annealing, genetic algorithms, etc.) </li></ul>Parameter space ( w ) R[f( x , w )] w *
  21. 21. Theoretical Foundations <ul><li>Structural Risk Minimization </li></ul><ul><li>Regularization </li></ul><ul><li>Weight decay </li></ul><ul><li>Feature selection </li></ul><ul><li>Data compression </li></ul>Training powerful models, without overfitting
  22. 22. Ockham’s Razor <ul><li>Principle proposed by William of Ockham in the fourteenth century: “ Pluralitas non est ponenda sine neccesitate ”. </li></ul><ul><li>Of two theories providing similarly good predictions, prefer the simplest one . </li></ul><ul><li>Shave off unnecessary parameters of your models. </li></ul>
  23. 23. Risk Minimization <ul><li>Examples are given : </li></ul><ul><li>( x 1 , y 1 ), ( x 2 , y 2 ), … ( x m , y m ) </li></ul><ul><li>Learning problem: find the best function f( x ;  ) minimizing a risk functional </li></ul><ul><li>R[f] =  L(f( x ; w ), y) d P( x , y) </li></ul>loss function unknown distribution
  24. 24. Approximations of R[f] <ul><li>Empirical risk : R train [f] = (1/n)  i=1:m L(f( x i ; w ), y i ) </li></ul><ul><ul><li>0/1 loss 1 (F( x i )  y i ) : R train [f] = error rate </li></ul></ul><ul><ul><li>square loss (f( x i )-y i ) 2 : R train [f] = mean square error </li></ul></ul><ul><li>Guaranteed risk : </li></ul><ul><li>With high probability (1-  ), R[f]  R gua [f] </li></ul><ul><li>R gua [f] = R train [f] +   C  </li></ul>
  25. 25. Structural Risk Minimization Vapnik, 1974 S 3 S 2 S 1 Increasing complexity Nested subsets of models, increasing complexity/capacity: S 1  S 2  … S N Tr, Training error Ga, Guaranteed risk Ga= Tr +  C   , Function of Model Complexity C Complexity/Capacity C
  26. 26. SRM Example <ul><li>Rank with || w || 2 =  i w i 2 </li></ul><ul><li>S k = { w | || w || 2 <  k 2 },  1 <  2 <…<  k </li></ul><ul><li>Minimization under constraint: </li></ul><ul><li>min R train [f] s.t. || w || 2 <  k 2 </li></ul><ul><li>Lagrangian: </li></ul><ul><li> R reg [f,  ] = R train [f] +  || w || 2 </li></ul>R capacity S 1  S 2  … S N
  27. 27. Multiple Structures <ul><li>Shrinkage (weight decay, ridge regression, SVM): </li></ul><ul><li>S k = { w | || w || 2 <  k },  1 <  2 <…<  k </li></ul><ul><li> 1 >  2 >  3 >… >  k (  is the ridge ) </li></ul><ul><li>Feature selection: </li></ul><ul><li>S k = { w | || w || 0 <  k }, </li></ul><ul><li> 1 <  2 <…<  k (  is the number of features ) </li></ul><ul><li>Data compression: </li></ul><ul><li>  1 <  2 <…<  k (  may be the number of clusters ) </li></ul>
  28. 28. Hyper-parameter Selection <ul><li>Learning = adjusting : </li></ul><ul><li>parameters ( w vector ) . </li></ul><ul><li>hyper-parameters  (  ) . </li></ul><ul><li>Cross-validation with K-folds: </li></ul><ul><li> For various values of  : </li></ul><ul><ul><li>- Adjust w on a fraction (K-1)/K of training examples e.g. 9/10 th . </li></ul></ul><ul><ul><li>- Test on 1/K remaining examples e.g. 1/10 th . </li></ul></ul><ul><ul><li>- Rotate examples and average test results (CV error). </li></ul></ul><ul><ul><li>- Select  to minimize CV error. </li></ul></ul><ul><ul><li>- Re-compute w on all training examples using optimal  . </li></ul></ul>X y Prospective study / “real” validation Training data: Make K folds Test data
  29. 29. Summary <ul><li>SRM provides a theoretical framework for robust predictive modeling (overfitting avoidance), using the notions of guaranteed risk and model capacity . </li></ul><ul><li>Multiple structures may be used to control the model capacity, including: feature selection, data compression, ridge regression. </li></ul>
  30. 30. KXEN (simplified) architecture D a t a P r e p a r a t i o n Learning Algorithm Class of Models D a t a E n c o d i n g L o s s C r i t e r i o n   w k x k y
  31. 31. KXEN: SRM put to work O M Fraction of customers selected Fraction of good customers selected Random lift Ideal Lift 100% 100% Customers ranked according to f(x); selection of the top ranking customers. G CV lift Training lift Test lift
  32. 32. Want to Learn More? <ul><li>Statistical Learning Theory , V. Vapnik. Theoretical book. Reference book on generatization, VC dimension, Structural Risk Minimization, SVMs, ISBN : 0471030031. </li></ul><ul><li>Pattern Classification, R. Duda, P. Hart, and D. Stork. Standard pattern recognition textbook. Limited to classification problems. Matlab code. http:// rii . ricoh .com/~stork/DHS.html </li></ul><ul><li>The Elements of statistical Learning: Data Mining, Inference, and Prediction. T. Hastie, R. Tibshirani, J. Friedman, Standard statistics textbook. Includes all the standard machine learning methods for classification, regression, clustering. R code. http://www-stat-class.stanford.edu/~tibs/ElemStatLearn/ </li></ul><ul><li>Feature Extraction: Foundations and Applications. I. Guyon et al, Eds. Book for practitioners with datasets of NIPS 2003 challenge, tutorials, best performing methods, Matlab code, teaching material. http:// clopinet .com/ fextract -book </li></ul>

×