Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Prochain SlideShare
Chargement dans…5
×

# Regression using Apache SystemML by Alexandre V Evfimievski

149 vues

Publié le

This deck will present regression algorithms Linear Regression -- Least Square, Direct solve -- , Conjugate Gradient, and Generalized Linear Model supported in Apache SystemML

Publié dans : Formation
• Full Name
Comment goes here.

Are you sure you want to Yes No

• Soyez le premier à aimer ceci

### Regression using Apache SystemML by Alexandre V Evfimievski

1. 1. Regression in SystemML Alexandre Evfimievski 1
2. 2. Linear Regression • INPUT: Records (x1, y1), (x2, y2), …, (xn, yn) – Each xi is m-dimensional: xi1, xi2, …, xim – Each yi is 1-dimensional • Want to approximate yi as a linear combination of xi-entries – yi ≈ β1xi1 + β2xi2 + … + βmxim – Case m = 1: yi ≈ β1xi1 ( Note: x = 0 maps to y = 0 ) • Intercept: a “free parameter” for default value of yi – yi ≈ β1xi1 + β2xi2 + … + βmxim + βm+1 – Case m = 1: yi ≈ β1xi1 + β2 • Matrix notation: Y ≈ Xβ, or Y ≈ (X |1) β if with intercept – X is n × m, Y is n × 1, β is m × 1 or (m+1) × 1 2
3. 3. Linear Regression: Least Squares • How to aggregate errors: yi – (β1xi1 + β2xi2 + … + βmxim) ? – What’s worse: many small errors, or a few big errors? • Sum of squares: ∑i≤n (yi – (β1xi1 + β2xi2 + … + βmxim))2 → min – A few big errors are much worse! We square them! • Matrix notation: (Y – Xβ)T (Y – Xβ) → min • Good news: easy to solve and find the β’s • Bad news: too sensitive to outliers! 3
4. 4. Linear Regression: Direct Solve • (Y – Xβ)T (Y – Xβ) → min • YT Y – YT (Xβ) – (Xβ)T Y + (Xβ)T (Xβ) → min • ½ βT (XTX) β – βT (XTY) → min • Take the gradient and set it to 0: (XTX) β – (XTY) = 0 • Linear equation: (XTX) β = XTY; Solution: β = (XTX)–1 (XTY) A = t(X) %*% X; b = t(X) %*% y; . . . . . . beta_unscaled = solve (A, b); 4
5. 5. Computation of XTX • Input (n × m)-matrix X is often huge and sparse – Rows X[i, ] make up n records, often n >> 106 – Columns X[, j] are the features • Matrix XTX is (m × m) and dense – Cells: (XTX) [j1, j2] = ∑ i≤n X[i, j1] * X[i, j2] – Part of covariance between features # j1 and # j2 across all records – m could be small or large • If m ≤ 1000, XTX is small and “direct solve” is efficient… – … as long as XTX is computed the right way! – … and as long as XTX is invertible (no linearly dependent features) 5
6. 6. Computation of XTX • Naïve computation: a) Read X into memory b) Copy it and rearrange cells into the transpose c) Multiply two huge matrices, XT and X • There is a better way: XTX = ∑i≤n X[i, ]T X[i, ] (outer product) – For all i = 1, …, n in parallel: a) Read one row X[i, ] b) Compute (m × m)-matrix: Mi [j1, j2] = X[i, j1] * X[i, j2] c) Aggregate: M = M + Mi • Extends to (XTX) v and XT diag(w) X, used in other scripts: – (XTX) v = ∑i≤n (∑ j≤m X[i, j]v[j]) * X[i, ]T – XT diag(w)X = ∑ i≤n wi * X[i, ]T X[i, ] 6
7. 7. Conjugate Gradient • What if XTX is too large, m >> 1000? – Dense XTX may take far more memory than sparse X • Full XTX not needed to solve (XTX) β = XTY – Use iterative method – Only evaluate (XTX)v for certain vectors v • Ex.: Gradient Descent for f (β) = ½ βT (XTX) β – βT (XTY) – Start with any β = β0 – Take the gradient: r = df(β) = (XTX) β – (XTY) (also, residual) – Find number a to minimize f(β + a ·r): a = – (rT r) / (rT XTX r) – Update: βnew ← β + a·r • But gradient is too local – And “forgetful” *a · r 7
8. 8. Conjugate Gradient • PROBLEM: Gradient takes a very similar direction many times • Enforce orthogonality to prior directions? – Take the gradient: r = (XTX) β – (XTY) – Subtract prior directions: p(k) = r – λ1p(1) – … – λk-1p(k-1) • Pick λi to ensure (p(k) · p(i)) = 0 ??? – Find number a(k) to minimize f(β + a(k) ·p(k)), etc … • STILL, PROBLEMS: – Value a(k) does NOT minimize f(a(1) ·p(1) + … + a(k) ·p(k) + … + a(m) ·p(m)) – Keep all prior directions p(1), p(2), … , p(k) ? That’s a lot! • SOLUTION: Enforce Conjugacy – Conjugate vectors: uT (XTX) v = 0, instead of uT v = 0 • Matrix XTX acts as the “metric” in distorted space – This does minimize f(a(1) ·p(1) + … + a(k) ·p(k) + … + a(m) ·p(m)) • And, only need p(k-1) and r(k) to compute p(k) 8
9. 9. Conjugate Gradient • Algorithm, step by step i = 0; beta = matrix (0, ...); Initially: β = 0 r = - t(X) %*% y; Residual & gradient r = (XTX) β – (XTY) p = - r; Direction for β: negative gradient norm_r2 = sum (r ^ 2); Norm of residual error = rT r norm_r2_target = norm_r2 * tolerance ^ 2; Desired norm of residual error while (i < mi & norm_r2 > norm_r2_target) { WE HAVE: p is the next direction for β q = t(X) %*% (X %*% p) + lambda * p; q = (XTX) p a = norm_r2 / sum (p * q); a = rT r / p (XTX) p minimizes f(β + a· p) beta = beta + a * p; Update: βnew ← β + a · p r = r + a * q; rnew ← (XTX) (β + a· p) – (XTY) old_norm_r2 = norm_r2; = r + a · (XTX) p norm_r2 = sum (r ^ 2); Update the norm of residual error = rT r p = -r + (norm_r2 / old_norm_r2) * p; Update direction: (1) take negative gradient; (2) enforce conjugacy with previous direction i = i + 1; Conjugacy to all older directions is automatic! } 9
10. 10. Degeneracy and Regularization • PROBLEM: What if X has linearly dependent columns? – Cause: recoding categorical features, adding composite features – Then XTX is not a “metric”: exists ǁpǁ > 0 such that pT (XTX) p = 0 – In CG step a = rT r / p (XTX) p : Division By Zero! • In fact, then Least Squares has ∞ solutions – Most of them have HUGE β-values • Regularization: Penalize β with larger values – L2-Regularization: (Y – Xβ)T (Y – Xβ) + λ·βT β → min – Replace XTX with XTX + λI – Pick λ << diag(XTX), refine by cross-validation – Do NOT regularize intercept • CG: q = t(X) %*% (X %*% p) + lambda * p; 10
11. 11. Shifting and Scaling X • PROBLEM: Features have vastly different range: – Examples: [0, 1]; [2010, 2015]; [\$0.01, \$1 Billion] • Each βi in Y ≈ Xβ has different size & accuracy? – Regularization λ·βT β also range-dependent? • SOLUTION: Scale & shift features to mean = 0, variance = 1 – Needs intercept: Y ≈ (X| 1)β – Equivalently: (Xnew |1) = (X |1) %*% SST “Shift-Scale Transform” • BUT: Sparse X becomes Dense Xnew … • SOLUTION: (Xnew |1) %*% M = (X |1) %*% (SST %*% M) – Extends to XTX and other X-products – Further optimization: SST has special shape 11
12. 12. Shifting and Scaling X – Linear Regression Direct Solve code snippet example: A = t(X) %*% X; b = t(X) %*% y; if (intercept_status == 2) { A = t(diag (scale_X) %*% A + shift_X %*% A [m_ext, ]); A = diag (scale_X) %*% A + shift_X %*% A [m_ext, ]; b = diag (scale_X) %*% b + shift_X %*% b [m_ext, ]; } A = A + diag (lambda); beta_unscaled = solve (A, b); if (intercept_status == 2) { beta = scale_X * beta_unscaled; beta [m_ext, ] = beta [m_ext, ] + t(shift_X) %*% beta_unscaled; } else { beta = beta_unscaled; } 12
13. 13. Regression in Statistics • Model: Y = Xβ* + ε where ε is a random vector – There exists a “true” β* – Each εi is Gaussian with mean μi = Xi β* and variance σ2 • Likelihood maximization to estimate β* – Likelihood: ℓ(Y | X, β, σ) = ∏i ≤ n C(σ)·exp(– (yi – Xi β)2 / 2σ2) – Log ℓ(Y | X, β, σ) = n·c(σ) – ∑i ≤ n (yi – Xi β)2 / 2σ2 – Maximum likelihood over β = Least Squares • Why do we need statistical view? – Confidence intervals for parameters – Goodness of fit tests – Generalizations: replace Gaussian with another distribution 13
14. 14. Maximum Likelihood Estimator • In each (xi , yi) let yi have distribution ℓ(yi | xi , β, φ) – Records are mutually independent for i = 1, …, n • Estimator for β is a function f(X, Y) – Y is random → f(X, Y) random – Unbiased estimator: for all β, mean E f(X, Y) = β • Maximum likelihood estimator – MLE (X, Y) = argmaxβ ∏i ≤ n ℓ(yi | xi , β, φ) – Asymptotically unbiased: E MLE (X, Y) → β as n → ∞ • Cramér-Rao Bound – For unbiased estimators, Var f(X, Y) ≥ FI(X, β, φ) –1 – Fisher information: FI(X, β, φ) = – EY Hessianβ log ℓ(Y| X, β, φ) – For MLE: Var (MLE (X, Y)) → FI(X, β, φ)–1 as n → ∞ 14
15. 15. Variance of M.L.E. • Cramér-Rao Bound is a simple way to estimate variance of predicted parameters (for large n): 1. Maximize log ℓ(Y |X, β, φ) to estimate β 2. Compute the Hessian (2nd derivatives) of log ℓ(Y |X, β, φ) 3. Compute “expected” Hessian: FI = – EY Hessian 4. Invert FI as a matrix: get FI–1 5. Use FI–1 as approx. covariance matrix for the estimated β • For linear regression: – Log ℓ(Y | X, β, σ) = n·c(σ) – ∑i ≤ n (yi – Xi β)2 / 2σ2 – Hessian = –(1/σ2)·XTX; FI = (1/σ2)·XTX – Cov β ≈ σ2 ·(XTX) –1 ; Var βj ≈ σ2 ·diag((XTX) –1) j 15
16. 16. Variance of Y given X • MLE for variance of Y = 1/n · ∑ i ≤ n (yi – y avg)2 – To make it unbiased, replace 1/n with 1/(n – 1) • Variance of ε in Y = Xβ* + ε is residual variance – Estimator for Var(ε) = 1/(n – m – 1) · ∑i ≤ n (yi – Xi β)2 • Good regression must have: Var(ε) << Var(Y) – “Explained” variance = Var(Y) – Var(ε) • R-squared: estimate 1 – Var(ε) / Var(Y) to test fitness: – R2 plain = 1 – (∑ i ≤ n (yi – Xi β)2) / (∑ i ≤ n (yi – yavg)2) – R2 adj. = 1 – (∑ i ≤ n (yi – Xi β)2) / (∑ i ≤ n (yi – yavg)2) · (n – 1) / (n – m – 1) • Pearson residual: ri = (yi – Xi β) / Var(ε)1/2 – Should be approximately Gaussian with mean 0 and variance 1 – Can use in another fitness test (more on tests later) 16
17. 17. LinReg Scripts: Inputs # INPUT PARAMETERS: # -------------------------------------------------------------------------------------------- # NAME TYPE DEFAULT MEANING # -------------------------------------------------------------------------------------------- # X String --- Location (on HDFS) to read the matrix X of feature vectors # Y String --- Location (on HDFS) to read the 1-column matrix Y of response values # B String --- Location to store estimated regression parameters (the betas) # O String " " Location to write the printed statistics; by default is standard output # Log String " " Location to write per-iteration variables for log/debugging purposes # icpt Int 0 Intercept presence, shifting and rescaling the columns of X: # 0 = no intercept, no shifting, no rescaling; # 1 = add intercept, but neither shift nor rescale X; # 2 = add intercept, shift & rescale X columns to mean = 0, variance = 1 # reg Double 0.000001 Regularization constant (lambda) for L2-regularization; set to nonzero # for highly dependend/sparse/numerous features # tol Double 0.000001 Tolerance (epsilon); conjugate graduent procedure terminates early if # L2 norm of the beta-residual is less than tolerance * its initial norm # maxi Int 0 Maximum number of conjugate gradient iterations, 0 = no maximum # fmt String "text" Matrix output format for B (the betas) only, usually "text" or "csv" # -------------------------------------------------------------------------------------------- # OUTPUT: Matrix of regression parameters (the betas) and its size depend on icpt input value: # OUTPUT SIZE: OUTPUT CONTENTS: HOW TO PREDICT Y FROM X AND B: # icpt=0: ncol(X) x 1 Betas for X only Y ~ X %*% B[1:ncol(X), 1], or just X %*% B # icpt=1: ncol(X)+1 x 1 Betas for X and intercept Y ~ X %*% B[1:ncol(X), 1] + B[ncol(X)+1, 1] # icpt=2: ncol(X)+1 x 2 Col.1: betas for X & intercept Y ~ X %*% B[1:ncol(X), 1] + B[ncol(X)+1, 1] # Col.2: betas for shifted/rescaled X and intercept 17
18. 18. LinReg Scripts: Outputs # In addition, some regression statistics are provided in CSV format, one comma-separated # name-value pair per each line, as follows: # # NAME MEANING # ------------------------------------------------------------------------------------- # AVG_TOT_Y Average of the response value Y # STDEV_TOT_Y Standard Deviation of the response value Y # AVG_RES_Y Average of the residual Y - pred(Y|X), i.e. residual bias # STDEV_RES_Y Standard Deviation of the residual Y - pred(Y|X) # DISPERSION GLM-style dispersion, i.e. residual sum of squares / # deg. fr. # PLAIN_R2 Plain R^2 of residual with bias included vs. total average # ADJUSTED_R2 Adjusted R^2 of residual with bias included vs. total average # PLAIN_R2_NOBIAS Plain R^2 of residual with bias subtracted vs. total average # ADJUSTED_R2_NOBIAS Adjusted R^2 of residual with bias subtracted vs. total average # PLAIN_R2_VS_0 * Plain R^2 of residual with bias included vs. zero constant # ADJUSTED_R2_VS_0 * Adjusted R^2 of residual with bias included vs. zero constant # ------------------------------------------------------------------------------------- # * The last two statistics are only printed if there is no intercept (icpt=0) # # The Log file, when requested, contains the following per-iteration variables in CSV # format, each line containing triple (NAME, ITERATION, VALUE) with ITERATION = 0 for # initial values: # # NAME MEANING # ------------------------------------------------------------------------------------- # CG_RESIDUAL_NORM L2-norm of Conj.Grad.residual, which is A %*% beta - t(X) %*% y # where A = t(X) %*% X + diag (lambda), or a similar quantity # CG_RESIDUAL_RATIO Ratio of current L2-norm of Conj.Grad.residual over the initial # ------------------------------------------------------------------------------------- 18
19. 19. Caveats • Overfitting: β reflect individual records in X, not distribution – Typically, too few records (small n) or too many features (large m) – To detect, use cross-validation – To mitigate, select fewer features; regularization may help too • Outliers: Some records in X are highly abnormal – They badly violate distribution, or have very large cell-values – Check MIN and MAX of Y, X-columns, Xi β, ri 2 = (yi – Xi β)2 / Var(ε) – To mitigate, remove outliers, or change distribution or link function • Interpolation vs. extrapolation – A model trained on one kind of data may not carry over to another kind of data; the past may not predict the future – Great research topic! 19
20. 20. Generalized Linear Models • Linear Regression: Y = Xβ* + ε – Each yi is Normal(μi , σ2) where mean μi = Xi β* – Variance(yi) = σ2 = constant • Logistic Regression: – Each yi is Bernoulli(μi) where mean μi = 1 / (1 + exp (– Xi β*)) – Prob [yi = 1] = μi , Prob [yi = 0] = 1 – μi , mean = probability of 1 – Variance(yi) = μi (1 – μi) • Poisson Regression: – Each yi is Poisson(μi) where mean μi = exp(Xi β*) – Prob [yi = k] = (μi)k exp(– μi)/ k! for k = 0, 1, 2, … – Variance(yi) = μi • Only in Linear Regression we add error εi to mean μi 20
21. 21. Generalized Linear Models • GLM Regression: – Each yi has distribution = exp{(yi ·θi – b(θi))/a + c(yi , a)} – Canonical parameter θi represents the mean: μi = bʹ(θi) – Link function connects μi and Xi β* : Xi β* = g(μi), μi = g –1 (Xi β*) – Variance(yi) = a ·bʺ(θi) • Example: Linear Regression as GLM – C(σ)·exp(– (yi – Xi β)2 / 2σ2) = exp{(yi ·θi – b(θi))/a + c(yi , a)} – θi = μi = Xi β; b(θi) = (Xi β)2 / 2; a = σ2 = variance • Link function = identity; c(yi , a) = – yi 2 /2σ2 + log C(σ) • Example: Logistic Regression as GLM – (μi )y[i] (1 – μi)1 – y[i] = exp{yi · log(μi) – yi · log(1 – μi) + log(1 – μi)} = exp{(yi ·θi – b(θi))/ a + c(yi , a)} – θi = log(μi / (1 – μi)) = Xi β; b(θi) = – log(1 – μi) = log(1 + exp(θi)) • Link function = log (μ / (1 – μ)) ; Variance = μ(1 – μ) ; a = 1 21
22. 22. Generalized Linear Models • GLM Regression: – Each yi has distribution = exp{(yi ·θi – b(θi))/a + c(yi , a)} – Canonical parameter θi represents the mean: μi = bʹ(θi) – Link function connects μi and Xi β* : Xi β* = g(μi), μi = g –1 (Xi β*) – Variance(yi) = a ·bʺ(θi) • Why θi ? What is b(θi)? – θi makes formulas simpler, stands for μi (no big deal) – b(θi) defines what distribution it is: linear, logistic, Poisson, etc. – b(θi) connects mean with variance: Var(yi) = a·bʺ(θi), μi = bʹ(θi) • What is link function? – You choose it to link μi with your features β1xi1 + β2xi2 + … + βmxim – Additive effects: μi = Xi β; Multiplicative effects: μi = exp(Xi β) Bayes law effects: μi = 1 / (1 + exp (– Xi β)); Inverse: μi = 1 / (Xi β) – Xi β has range (– ∞, +∞), but μi may range in [0, 1], [0, +∞) etc. 22
23. 23. GLMs We Support • We specify GLM by: – Mean to variance connection – Link function (mean to feature sum connection) • Mean-to-variance for common distributions: – Var (yi) = a ·(μi)0 = σ2 : Linear / Gaussian – Var (yi) = a ·μi (1 – μi): Logistic / Binomial – Var (yi) = a ·(μi)1 : Poisson – Var (yi) = a ·(μi)2 : Gamma – Var (yi) = a ·(μi)3 : Inverse Gaussian • We support two types: Power and Binomial – Var (yi) = a ·(μi)α : Power, for any α – Var (yi) = a ·μi (1 – μi): Binomial 23
24. 24. GLMs We Support • We specify GLM by: – Mean to variance connection – Link function (mean to feature sum connection) Supported link functions • Power: Xi β = (μi)s where s = 0 stands for Xi β = log (μi) – Examples: identity, inverse, log, square root • Link functions used in binomial / logistic regression: – Logit, Probit, Cloglog, Cauchit – Link Xi β-range (– ∞, +∞) with μi-range (0, 1) – Differ in tail behavior • Canonical link function: – Makes Xi β = the canonical parameter θi , i.e. sets μi = bʹ(Xi β) – Power link Xi β = (μi)1 – α if Var = a·(μi)α ; Logit link for binomial 24
25. 25. GLM Script Inputs # NAME TYPE DEFAULT MEANING # --------------------------------------------------------------------------------------------- # X String --- Location to read the matrix X of feature vectors # Y String --- Location to read response matrix Y with either 1 or 2 columns: # if dfam = 2, Y is 1-column Bernoulli or 2-column Binomial (#pos, #neg) # B String --- Location to store estimated regression parameters (the betas) # fmt String "text" The betas matrix output format, such as "text" or "csv" # O String " " Location to write the printed statistics; by default is standard output # Log String " " Location to write per-iteration variables for log/debugging purposes # dfam Int 1 Distribution family code: 1 = Power, 2 = Binomial # vpow Double 0.0 Power for Variance defined as (mean)^power (ignored if dfam != 1): # 0.0 = Gaussian, 1.0 = Poisson, 2.0 = Gamma, 3.0 = Inverse Gaussian # link Int 0 Link function code: 0 = canonical (depends on distribution), # 1 = Power, 2 = Logit, 3 = Probit, 4 = Cloglog, 5 = Cauchit # lpow Double 1.0 Power for Link function defined as (mean)^power (ignored if link != 1): # -2.0 = 1/mu^2, -1.0 = reciprocal, 0.0 = log, 0.5 = sqrt, 1.0 = identity # yneg Double 0.0 Response value for Bernoulli "No" label, usually 0.0 or -1.0 # icpt Int 0 Intercept presence, X columns shifting and rescaling: # 0 = no intercept, no shifting, no rescaling; # 1 = add intercept, but neither shift nor rescale X; # 2 = add intercept, shift & rescale X columns to mean = 0, variance = 1 # reg Double 0.0 Regularization parameter (lambda) for L2 regularization # tol Double 0.000001 Tolerance (epsilon) # disp Double 0.0 (Over-)dispersion value, or 0.0 to estimate it from data # moi Int 200 Maximum number of outer (Newton / Fisher Scoring) iterations # mii Int 0 Maximum number of inner (Conjugate Gradient) iterations, 0 = no maximum # --------------------------------------------------------------------------------------------- # OUTPUT: Matrix beta, whose size depends on icpt: # icpt=0: ncol(X) x 1; icpt=1: (ncol(X) + 1) x 1; icpt=2: (ncol(X) + 1) x 2 25
26. 26. GLM Script Outputs # In addition, some GLM statistics are provided in CSV format, one comma-separated name-value # pair per each line, as follows: # ------------------------------------------------------------------------------------------- # TERMINATION_CODE A positive integer indicating success/failure as follows: # 1 = Converged successfully; 2 = Maximum number of iterations reached; # 3 = Input (X, Y) out of range; 4 = Distribution/link is not supported # BETA_MIN Smallest beta value (regression coefficient), excluding the intercept # BETA_MIN_INDEX Column index for the smallest beta value # BETA_MAX Largest beta value (regression coefficient), excluding the intercept # BETA_MAX_INDEX Column index for the largest beta value # INTERCEPT Intercept value, or NaN if there is no intercept (if icpt=0) # DISPERSION Dispersion used to scale deviance, provided as "disp" input parameter # or estimated (same as DISPERSION_EST) if the "disp" parameter is <= 0 # DISPERSION_EST Dispersion estimated from the dataset # DEVIANCE_UNSCALED Deviance from the saturated model, assuming dispersion == 1.0 # DEVIANCE_SCALED Deviance from the saturated model, scaled by the DISPERSION value # ------------------------------------------------------------------------------------------- # # The Log file, when requested, contains the following per-iteration variables in CSV format, # each line containing triple (NAME, ITERATION, VALUE) with ITERATION = 0 for initial values: # ------------------------------------------------------------------------------------------- # NUM_CG_ITERS Number of inner (Conj.Gradient) iterations in this outer iteration # IS_TRUST_REACHED 1 = trust region boundary was reached, 0 = otherwise # POINT_STEP_NORM L2-norm of iteration step from old point (i.e. "beta") to new point # OBJECTIVE The loss function we minimize (i.e. negative partial log-likelihood) # OBJ_DROP_REAL Reduction in the objective during this iteration, actual value # OBJ_DROP_PRED Reduction in the objective predicted by a quadratic approximation # OBJ_DROP_RATIO Actual-to-predicted reduction ratio, used to update the trust region # GRADIENT_NORM L2-norm of the loss function gradient (NOTE: sometimes omitted) # LINEAR_TERM_MIN The minimum value of X %*% beta, used to check for overflows # LINEAR_TERM_MAX The maximum value of X %*% beta, used to check for overflows # IS_POINT_UPDATED 1 = new point accepted; 0 = new point rejected, old point restored # TRUST_DELTA Updated trust region size, the "delta" # ------------------------------------------------------------------------------------------- 26
27. 27. GLM Likelihood Maximization • 1 record: ℓ (yi | θi , a) = exp{(yi ·θi – b(θi))/ a + c(yi , a)} • Log ℓ (Y |Θ, a) = 1/a · ∑ i ≤ n (yi · θi – b(θi)) + const(Θ) • f(β; X, Y) = – ∑i ≤ n (yi · θi – b(θi)) + λ/2 · βT β → min – Here θi is a function of β: θi = bʹ–1 (g –1 (Xi β)) – Add regularization with λ/2 to agree with least squares – If X has intercept, do NOT regularize its β-value • Non-quadratic; how to optimize? – Gradient descent: fastest when far from optimum – Newton method: fastest when close to optimum • Trust Region Conjugate Gradient – Strikes a good balance between the above two 27
28. 28. GLM Likelihood Maximization • f(β; X, Y) = – ∑i ≤ n (yi · θi – b(θi)) + λ/2 · βT β → min • Outer iteration: From β to βnew = β + z – ∆f (z; β) := f(β + z; X, Y) – f(β; X, Y) • Use “Fisher Scoring” to approximate Hessian and ∆f (z; β) – ∆f (z; β) ≈ ½·zT A z + GT z, where: – A = XT diag(w)X + λI and G = – XT u + λ·β – Vectors u, w depend on β via mean-to-variance and link functions • Trust Region: Area ǁzǁ2 ≤ δ where we trust the approximation ∆f (z; β) ≈ ½ ·zT A z + GT z – ǁzǁ2 ≤ δ too small → Gradient Descent step (1 inner iteration) – ǁzǁ2 ≤ δ mid-size → Cut-off Conjugate Gradient step (2 or more) – ǁzǁ2 ≤ δ too wide → Full Conjugate Gradient step FI = XT diag(w) X is “expected” Hessian 28
29. 29. Trust Region Conj. Gradient • Code snippet for Logistic Regression g = - 0.5 * t(X) %*% y; f_val = - N * log (0.5); delta = 0.5 * sqrt (D) / max (sqrt (rowSums (X ^ 2))); exit_g2 = sum (g ^ 2) * tolerance ^ 2; while (sum (g ^ 2) > exit_g2 & i < max_i) { i = i + 1; r = g; r2 = sum (r ^ 2); exit_r2 = 0.01 * r2; d = - r; z = zeros_D; j = 0; trust_bound_reached = FALSE; while (r2 > exit_r2 & (! trust_bound_reached) & j < max_j) { j = j + 1; Hd = lambda * d + t(X) %*% diag (w) %*% X %*% d; c = r2 / sum (d * Hd); [c, trust_bound_reached] = ensure_quadratic (c, sum(d^2), 2 * sum(z*d), sum(z^2) - delta^2); z = z + c * d; r = r + c * Hd; r2_new = sum (r ^ 2); d = - r + (r2_new / r2) * d; r2 = r2_new; } p = 1.0 / (1.0 + exp (- y * (X %*% (beta + z)))); f_chg = - sum (log (p)) + 0.5 * lambda * sum ((beta + z) ^ 2) - f_val; delta = update_trust_region (delta, sqrt(sum(z^2)), f_chg, sum(z*g), 0.5 * sum(z*(r + g))); if (f_chg < 0) { beta = beta + z; f_val = f_val + f_chg; w = p * (1 - p); g = - t(X) %*% ((1 - p) * y) + lambda * beta; } } ensure_quadratic = function (double x, a, b, c) return (double x_new, boolean test) { test = (a * x^2 + b * x + c > 0); if (test) { rad = sqrt (b ^ 2 - 4 * a * c); if (b >= 0) { x_new = - (2 * c) / (b + rad); } else { x_new = - (b - rad) / (2 * a); } } else { x_new = x; } } 29
30. 30. Trust Region Conj. Gradient • Trust region update in Logistic Regression snippet update_trust_region = function (double delta, double z_distance, double f_chg_exact, double f_chg_linear_approx, double f_chg_quadratic_approx) return (double delta) { sigma1 = 0.25; sigma2 = 0.5; sigma3 = 4.0; if (f_chg_exact <= f_chg_linear_approx) { alpha = sigma3; } else { alpha = max (sigma1, - 0.5 * f_chg_linear_approx / (f_chg_exact - f_chg_linear_approx)); } rho = f_chg_exact / f_chg_quadratic_approx; if (rho < 0.0001) { delta = min (max (alpha, sigma1) * z_distance, sigma2 * delta); } else { if (rho < 0.25) { delta = max (sigma1 * delta, min (alpha * z_distance, sigma2 * delta)); } else { if (rho < 0.75) { delta = max (sigma1 * delta, min (alpha * z_distance, sigma3 * delta)); } else { delta = max (delta, min (alpha * z_distance, sigma3 * delta)); }}} } 30
31. 31. GLM: Other Statistics • REMINDER: – Each yi has distribution = exp{(yi ·θi – b(θi))/a + c(yi , a)} – Variance(yi) = a ·bʺ(θi) = a·V(μi) • Variance of Y given X – Estimating the β gives V(μi) = V (g–1 (Xi β)) – Constant “a” is called dispersion, analogue of σ2 – Estimator: a ≈ 1/(n – m)·∑ i ≤ n (yi – μi)2 / V(μi) • Variance of parameters β – We use MLE, hence Cramér-Rao formula applies (for large n) – Fisher Information: FI = (1/a)· XT diag(w)X, wi = (V(μi) ·gʹ(μi)2)–1 – Estimator: Cov β ≈ a·(XT diag(w)X)–1, Var βj = (Cov β)jj 31
32. 32. GLM: Deviance • Let X have m features, of which k may have no effect on Y – Will “no effect” result in βj ≈ 0 ? (Unlikely.) – Estimate βj and Var βj then test βj / (Var βj)1/2 against N(0, 1)? • Student’s t-test is better • Likelihood Ratio Test: • Null Hypothesis: Y given X follows GLM with β1 = … = βk = 0 – If NH is true, D is asympt. distributed as χ2 with k deg. of freedom – If NH is false, D → +¥as n → +¥ • P-value % = Prob[ χ2 k > D] · 100% ( ) ( ) 0 ...,,,0...,,0;,|max ...,,,...,,;,|max log2 1GLM 11GLM > ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ ⋅= + + mk mkk aXYL aXYL D ββ ββββ β β 32
33. 33. GLM: Deviance • To test many nested models (feature subsets) we need their maximum likelihoods to compute D – PROBLEM: Term “c(yi , a)” in GLM’s exp{(yi ·θi – b(θi))/ a + c(yi , a)} • Instead, compute deviance: • “Saturated model” has no X, no β, but picks the best θi for each individual yi (not realistic at all, just convention) – Term “c(yi , a)” is the same in both models! – But “a” has to be fixed, e.g. to 1 • Deviance itself is used for goodness of fit tests, too ( ) ( ) 0 ...,,,...,,;,|max modelsaturated:;|max log2 11GLM GLM > ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ Θ ⋅= + Θ mkkaXYL aYL D ββββ β 33
34. 34. Survival Analysis Given Survival data from individuals as (time, event) Categorical/continuous features for each individual Estimate Probability of survival to a feature time Rate of hazard at a given time Ex. † death from specific cancer ? lost to follow-up † † ? † ? 1 2 3 4 5 6 7 8 9 I I I I I I I I I Patients 2 1 3 4 5 Time 27 34
35. 35. Cox Regression Semi-parametric model “robust” Most commonly used Handles categorical and continuous data Handles (right/left/interval) censored data Baseline hazard covariates coefficients 29 35
36. 36. 36 Event Hazard Rate • Symptom events E follow a Poisson process: timeE1 E2 E3 E4 Death Hazard function Hazard function = Poisson rate: Given state and hazard, we could compute the probability of the observed event count: [ ] t tttE th t Δ Δ+∈ = →Δ state),[Prob limstate);( | 0 [ ] , ! ineventsProb 21 K He tttK KH− =≤≤ dttthH t t ))(state;( 2 1 ∫=
37. 37. 37 Cox Proportional Hazards • Assume that exactly 1 patient gets event E at time t • The probability that it is Patient #i is the hazard ratio: • Cox assumption: • Time confounder cancels out! t [ ] ∑ = = n j ji sthsthEi 1 );();(gets#Prob s1 si = statei s2 sn Patient #1 Patient #2 Patient #3 Patient #n – 1 Patient #n . . . . . )(exp)((state))(state);( T 00 sththth λ⋅=Λ⋅=
38. 38. 38 Cox “Partial” Likelihood • Cox “partial” likelihood for the dataset is a product over all E: Patient #1 Patient #2 Patient #3 Patient #n – 1 Patient #n . . . . . [ ] ∏ ∑ ∏ ∑ == === EtEt n j j t n j j t ts ts tsth tsth EL :: 1 T )(who T 1 )(who Cox )( )( )( )( )(exp )(exp )(; )(; allProb)( λ λ λ
39. 39. Cox Regression Semi-parametric model “robust” Most commonly used Handles categorical and continuous data Handles (right/left/interval) censored data Cox regression in DML Fitting parameters via negative partial log-likelihood Method: trust region Newton with conjugate gradient Inverting the Hessian using block Cholesky for computing std error of betas Similar features as coxph() in R, e.g., stratification, frequency weights, offsets, goodness of fit testing, recurrent event analysis Baseline hazard covariates coefficients 29 39
40. 40. BACK-UP 40
41. 41. Kaplan-Meier Estimator 28 41
42. 42. Kaplan-Meier Estimator 28 42
43. 43. Confidence Intervals • Definition of Confidence Interval; p-value • Likelihood ratio test • How to use it for confidence interval • Degrees of freedom 43