Ce diaporama a bien été signalé.
Nous utilisons votre profil LinkedIn et vos données d’activité pour vous proposer des publicités personnalisées et pertinentes. Vous pouvez changer vos préférences de publicités à tout moment.
Regression	in	SystemML
Alexandre	Evfimievski
1
Linear	Regression
• INPUT:		Records	(x1,	y1),	(x2,	y2),	…,	(xn,	yn)
– Each	xi is	m-dimensional:	xi1,	xi2,	…,	xim
– Each	yi...
Linear	Regression:	Least	Squares
• How	to	aggregate	errors:		yi – (β1xi1 +	β2xi2 +	…	+	βmxim)		?
– What’s	worse:		many	sma...
Linear	Regression:	Direct	Solve
• (Y	– Xβ)T (Y	– Xβ)		→		min
• YT	Y		– YT	(Xβ)		– (Xβ)T	Y		+		(Xβ)T	(Xβ)		→		min
• ½	βT	(X...
Computation		of		XTX
• Input	(n	× m)-matrix		X		is	often	huge	and	sparse
– Rows		X[i,	]		make	up		n		records,		often		n	>>...
Computation		of		XTX
• Naïve	computation:
a) Read	X	into	memory
b) Copy	it	and	rearrange	cells	into	the	transpose
c) Multi...
Conjugate	Gradient
• What	if		XTX		is	too	large,	m	>>	1000?
– Dense		XTX		may	take	far	more	memory	than	sparse		X
• Full		...
Conjugate	Gradient
• PROBLEM:		Gradient	takes	a	very	similar	direction	many	times
• Enforce	orthogonality	to	prior	directi...
Conjugate	Gradient
• Algorithm,	step	by	step
i = 0; beta = matrix (0, ...); Initially:		β =	0
r = - t(X) %*% y; Residual	&...
Degeneracy	and	Regularization
• PROBLEM:		What	if		X		has	linearly	dependent	columns?
– Cause:		recoding	categorical	featu...
Shifting	and	Scaling	X
• PROBLEM:		Features	have	vastly	different	range:
– Examples:		[0,	1];		[2010,	2015];		[$0.01,		$1	...
Shifting	and	Scaling	X
– Linear	Regression	Direct	Solve	
code	snippet	example:
A = t(X) %*% X;
b = t(X) %*% y;
if (interce...
Regression	in	Statistics
• Model:		Y	=	Xβ* +	ε where		ε is	a	random	vector
– There	exists	a	“true”	β*
– Each		εi is	Gaussi...
Maximum	Likelihood	Estimator
• In	each		(xi	,	yi)		let		yi have	distribution		ℓ(yi |	xi	,	β,	φ)
– Records	are	mutually	ind...
Variance	of	M.L.E.
• Cramér-Rao	Bound	is	a	simple	way	to	estimate	variance	of	
predicted	parameters	(for	large	n):
1. Maxi...
Variance	of		Y		given		X
• MLE	for	variance	of	Y		=		1/n	·	∑ i ≤	n (yi – y avg)2
– To	make	it	unbiased,	replace		1/n		with...
LinReg	Scripts:	Inputs
# INPUT PARAMETERS:
# -----------------------------------------------------------------------------...
LinReg	Scripts:	Outputs
# In addition, some regression statistics are provided in CSV format, one comma-separated
# name-v...
Caveats
• Overfitting:		β reflect	individual	records	in		X,	not	distribution
– Typically,	too	few	records	(small	n)	or	too...
Generalized	Linear	Models
• Linear	Regression:		Y = Xβ* +	ε
– Each		yi is	Normal(μi ,	σ2)		where	mean		μi =	Xi	β*
– Varian...
Generalized	Linear	Models
• GLM	Regression:
– Each		yi has	distribution		=		exp{(yi ·θi – b(θi))/a + c(yi ,	a)}
– Canonica...
Generalized	Linear	Models
• GLM	Regression:
– Each		yi has	distribution		=		exp{(yi ·θi		– b(θi))/a + c(yi	,	a)}
– Canonic...
GLMs	We	Support
• We	specify	GLM	by:
– Mean	to	variance	connection
– Link	function	(mean	to	feature	sum	connection)
• Mean...
GLMs	We	Support
• We	specify	GLM	by:
– Mean	to	variance	connection
– Link	function	(mean	to	feature	sum	connection)
Suppor...
GLM	Script	Inputs
# NAME TYPE DEFAULT MEANING
# --------------------------------------------------------------------------...
GLM	Script	Outputs
# In addition, some GLM statistics are provided in CSV format, one comma-separated name-value
# pair pe...
GLM	Likelihood	Maximization
• 1	record:		ℓ (yi	| θi	,	a)		=		exp{(yi ·θi		– b(θi))/ a + c(yi	,	a)}
• Log	ℓ (Y |Θ,	a)		=		1...
GLM	Likelihood	Maximization
• f(β;	X,	Y)		=		– ∑i	≤	n (yi · θi		– b(θi)) +		λ/2 · βT	β →		min
• Outer	iteration:		From		β ...
Trust	Region	Conj.	Gradient
• Code	snippet	for	
Logistic	Regression
g = - 0.5 * t(X) %*% y; f_val = - N * log (0.5);
delta...
Trust	Region	Conj.	Gradient
• Trust	region	update	in	
Logistic	Regression	snippet
update_trust_region =
function (double d...
GLM:	Other	Statistics
• REMINDER:
– Each		yi has	distribution		=		exp{(yi ·θi		– b(θi))/a + c(yi	,	a)}
– Variance(yi)		=		...
GLM:		Deviance
• Let		X		have		m		features,	of	which		k		may	have	no	effect	on		Y
– Will	“no	effect”	result	in		βj ≈	0	?		...
GLM:		Deviance
• To	test	many	nested	models	(feature	subsets)	we	need	their	
maximum	likelihoods	to	compute		D
– PROBLEM:	...
Survival Analysis
Given
Survival data from individuals as (time, event)
Categorical/continuous features for each individua...
Cox Regression
Semi-parametric model “robust”
Most commonly used
Handles categorical and continuous data
Handles (right/le...
36
Event	Hazard	Rate
• Symptom	events	E follow	a	Poisson	process:
timeE1 E2 E3 E4
Death
Hazard
function
Hazard function = ...
37
Cox	Proportional	Hazards
• Assume	that	exactly	1	patient	gets	event	E at	time	t
• The	probability	that	it	is		Patient	#...
38
Cox	“Partial”	Likelihood
• Cox	“partial”	likelihood	for	the	dataset	is	a	product	over	all	E:
Patient #1
Patient #2
Pati...
Cox Regression
Semi-parametric model “robust”
Most commonly used
Handles categorical and continuous data
Handles (right/le...
BACK-UP
40
Kaplan-Meier Estimator
28
41
Kaplan-Meier Estimator
28
42
Confidence	Intervals
• Definition	of	Confidence	Interval;	p-value
• Likelihood	ratio	test
• How	to	use	it	for	confidence	i...
Prochain SlideShare
Chargement dans…5
×

Regression using Apache SystemML by Alexandre V Evfimievski

149 vues

Publié le

This deck will present regression algorithms Linear Regression -- Least Square, Direct solve -- , Conjugate Gradient, and Generalized Linear Model supported in Apache SystemML

Publié dans : Formation
  • Login to see the comments

  • Soyez le premier à aimer ceci

Regression using Apache SystemML by Alexandre V Evfimievski

  1. 1. Regression in SystemML Alexandre Evfimievski 1
  2. 2. Linear Regression • INPUT: Records (x1, y1), (x2, y2), …, (xn, yn) – Each xi is m-dimensional: xi1, xi2, …, xim – Each yi is 1-dimensional • Want to approximate yi as a linear combination of xi-entries – yi ≈ β1xi1 + β2xi2 + … + βmxim – Case m = 1: yi ≈ β1xi1 ( Note: x = 0 maps to y = 0 ) • Intercept: a “free parameter” for default value of yi – yi ≈ β1xi1 + β2xi2 + … + βmxim + βm+1 – Case m = 1: yi ≈ β1xi1 + β2 • Matrix notation: Y ≈ Xβ, or Y ≈ (X |1) β if with intercept – X is n × m, Y is n × 1, β is m × 1 or (m+1) × 1 2
  3. 3. Linear Regression: Least Squares • How to aggregate errors: yi – (β1xi1 + β2xi2 + … + βmxim) ? – What’s worse: many small errors, or a few big errors? • Sum of squares: ∑i≤n (yi – (β1xi1 + β2xi2 + … + βmxim))2 → min – A few big errors are much worse! We square them! • Matrix notation: (Y – Xβ)T (Y – Xβ) → min • Good news: easy to solve and find the β’s • Bad news: too sensitive to outliers! 3
  4. 4. Linear Regression: Direct Solve • (Y – Xβ)T (Y – Xβ) → min • YT Y – YT (Xβ) – (Xβ)T Y + (Xβ)T (Xβ) → min • ½ βT (XTX) β – βT (XTY) → min • Take the gradient and set it to 0: (XTX) β – (XTY) = 0 • Linear equation: (XTX) β = XTY; Solution: β = (XTX)–1 (XTY) A = t(X) %*% X; b = t(X) %*% y; . . . . . . beta_unscaled = solve (A, b); 4
  5. 5. Computation of XTX • Input (n × m)-matrix X is often huge and sparse – Rows X[i, ] make up n records, often n >> 106 – Columns X[, j] are the features • Matrix XTX is (m × m) and dense – Cells: (XTX) [j1, j2] = ∑ i≤n X[i, j1] * X[i, j2] – Part of covariance between features # j1 and # j2 across all records – m could be small or large • If m ≤ 1000, XTX is small and “direct solve” is efficient… – … as long as XTX is computed the right way! – … and as long as XTX is invertible (no linearly dependent features) 5
  6. 6. Computation of XTX • Naïve computation: a) Read X into memory b) Copy it and rearrange cells into the transpose c) Multiply two huge matrices, XT and X • There is a better way: XTX = ∑i≤n X[i, ]T X[i, ] (outer product) – For all i = 1, …, n in parallel: a) Read one row X[i, ] b) Compute (m × m)-matrix: Mi [j1, j2] = X[i, j1] * X[i, j2] c) Aggregate: M = M + Mi • Extends to (XTX) v and XT diag(w) X, used in other scripts: – (XTX) v = ∑i≤n (∑ j≤m X[i, j]v[j]) * X[i, ]T – XT diag(w)X = ∑ i≤n wi * X[i, ]T X[i, ] 6
  7. 7. Conjugate Gradient • What if XTX is too large, m >> 1000? – Dense XTX may take far more memory than sparse X • Full XTX not needed to solve (XTX) β = XTY – Use iterative method – Only evaluate (XTX)v for certain vectors v • Ex.: Gradient Descent for f (β) = ½ βT (XTX) β – βT (XTY) – Start with any β = β0 – Take the gradient: r = df(β) = (XTX) β – (XTY) (also, residual) – Find number a to minimize f(β + a ·r): a = – (rT r) / (rT XTX r) – Update: βnew ← β + a·r • But gradient is too local – And “forgetful” *a · r 7
  8. 8. Conjugate Gradient • PROBLEM: Gradient takes a very similar direction many times • Enforce orthogonality to prior directions? – Take the gradient: r = (XTX) β – (XTY) – Subtract prior directions: p(k) = r – λ1p(1) – … – λk-1p(k-1) • Pick λi to ensure (p(k) · p(i)) = 0 ??? – Find number a(k) to minimize f(β + a(k) ·p(k)), etc … • STILL, PROBLEMS: – Value a(k) does NOT minimize f(a(1) ·p(1) + … + a(k) ·p(k) + … + a(m) ·p(m)) – Keep all prior directions p(1), p(2), … , p(k) ? That’s a lot! • SOLUTION: Enforce Conjugacy – Conjugate vectors: uT (XTX) v = 0, instead of uT v = 0 • Matrix XTX acts as the “metric” in distorted space – This does minimize f(a(1) ·p(1) + … + a(k) ·p(k) + … + a(m) ·p(m)) • And, only need p(k-1) and r(k) to compute p(k) 8
  9. 9. Conjugate Gradient • Algorithm, step by step i = 0; beta = matrix (0, ...); Initially: β = 0 r = - t(X) %*% y; Residual & gradient r = (XTX) β – (XTY) p = - r; Direction for β: negative gradient norm_r2 = sum (r ^ 2); Norm of residual error = rT r norm_r2_target = norm_r2 * tolerance ^ 2; Desired norm of residual error while (i < mi & norm_r2 > norm_r2_target) { WE HAVE: p is the next direction for β q = t(X) %*% (X %*% p) + lambda * p; q = (XTX) p a = norm_r2 / sum (p * q); a = rT r / p (XTX) p minimizes f(β + a· p) beta = beta + a * p; Update: βnew ← β + a · p r = r + a * q; rnew ← (XTX) (β + a· p) – (XTY) old_norm_r2 = norm_r2; = r + a · (XTX) p norm_r2 = sum (r ^ 2); Update the norm of residual error = rT r p = -r + (norm_r2 / old_norm_r2) * p; Update direction: (1) take negative gradient; (2) enforce conjugacy with previous direction i = i + 1; Conjugacy to all older directions is automatic! } 9
  10. 10. Degeneracy and Regularization • PROBLEM: What if X has linearly dependent columns? – Cause: recoding categorical features, adding composite features – Then XTX is not a “metric”: exists ǁpǁ > 0 such that pT (XTX) p = 0 – In CG step a = rT r / p (XTX) p : Division By Zero! • In fact, then Least Squares has ∞ solutions – Most of them have HUGE β-values • Regularization: Penalize β with larger values – L2-Regularization: (Y – Xβ)T (Y – Xβ) + λ·βT β → min – Replace XTX with XTX + λI – Pick λ << diag(XTX), refine by cross-validation – Do NOT regularize intercept • CG: q = t(X) %*% (X %*% p) + lambda * p; 10
  11. 11. Shifting and Scaling X • PROBLEM: Features have vastly different range: – Examples: [0, 1]; [2010, 2015]; [$0.01, $1 Billion] • Each βi in Y ≈ Xβ has different size & accuracy? – Regularization λ·βT β also range-dependent? • SOLUTION: Scale & shift features to mean = 0, variance = 1 – Needs intercept: Y ≈ (X| 1)β – Equivalently: (Xnew |1) = (X |1) %*% SST “Shift-Scale Transform” • BUT: Sparse X becomes Dense Xnew … • SOLUTION: (Xnew |1) %*% M = (X |1) %*% (SST %*% M) – Extends to XTX and other X-products – Further optimization: SST has special shape 11
  12. 12. Shifting and Scaling X – Linear Regression Direct Solve code snippet example: A = t(X) %*% X; b = t(X) %*% y; if (intercept_status == 2) { A = t(diag (scale_X) %*% A + shift_X %*% A [m_ext, ]); A = diag (scale_X) %*% A + shift_X %*% A [m_ext, ]; b = diag (scale_X) %*% b + shift_X %*% b [m_ext, ]; } A = A + diag (lambda); beta_unscaled = solve (A, b); if (intercept_status == 2) { beta = scale_X * beta_unscaled; beta [m_ext, ] = beta [m_ext, ] + t(shift_X) %*% beta_unscaled; } else { beta = beta_unscaled; } 12
  13. 13. Regression in Statistics • Model: Y = Xβ* + ε where ε is a random vector – There exists a “true” β* – Each εi is Gaussian with mean μi = Xi β* and variance σ2 • Likelihood maximization to estimate β* – Likelihood: ℓ(Y | X, β, σ) = ∏i ≤ n C(σ)·exp(– (yi – Xi β)2 / 2σ2) – Log ℓ(Y | X, β, σ) = n·c(σ) – ∑i ≤ n (yi – Xi β)2 / 2σ2 – Maximum likelihood over β = Least Squares • Why do we need statistical view? – Confidence intervals for parameters – Goodness of fit tests – Generalizations: replace Gaussian with another distribution 13
  14. 14. Maximum Likelihood Estimator • In each (xi , yi) let yi have distribution ℓ(yi | xi , β, φ) – Records are mutually independent for i = 1, …, n • Estimator for β is a function f(X, Y) – Y is random → f(X, Y) random – Unbiased estimator: for all β, mean E f(X, Y) = β • Maximum likelihood estimator – MLE (X, Y) = argmaxβ ∏i ≤ n ℓ(yi | xi , β, φ) – Asymptotically unbiased: E MLE (X, Y) → β as n → ∞ • Cramér-Rao Bound – For unbiased estimators, Var f(X, Y) ≥ FI(X, β, φ) –1 – Fisher information: FI(X, β, φ) = – EY Hessianβ log ℓ(Y| X, β, φ) – For MLE: Var (MLE (X, Y)) → FI(X, β, φ)–1 as n → ∞ 14
  15. 15. Variance of M.L.E. • Cramér-Rao Bound is a simple way to estimate variance of predicted parameters (for large n): 1. Maximize log ℓ(Y |X, β, φ) to estimate β 2. Compute the Hessian (2nd derivatives) of log ℓ(Y |X, β, φ) 3. Compute “expected” Hessian: FI = – EY Hessian 4. Invert FI as a matrix: get FI–1 5. Use FI–1 as approx. covariance matrix for the estimated β • For linear regression: – Log ℓ(Y | X, β, σ) = n·c(σ) – ∑i ≤ n (yi – Xi β)2 / 2σ2 – Hessian = –(1/σ2)·XTX; FI = (1/σ2)·XTX – Cov β ≈ σ2 ·(XTX) –1 ; Var βj ≈ σ2 ·diag((XTX) –1) j 15
  16. 16. Variance of Y given X • MLE for variance of Y = 1/n · ∑ i ≤ n (yi – y avg)2 – To make it unbiased, replace 1/n with 1/(n – 1) • Variance of ε in Y = Xβ* + ε is residual variance – Estimator for Var(ε) = 1/(n – m – 1) · ∑i ≤ n (yi – Xi β)2 • Good regression must have: Var(ε) << Var(Y) – “Explained” variance = Var(Y) – Var(ε) • R-squared: estimate 1 – Var(ε) / Var(Y) to test fitness: – R2 plain = 1 – (∑ i ≤ n (yi – Xi β)2) / (∑ i ≤ n (yi – yavg)2) – R2 adj. = 1 – (∑ i ≤ n (yi – Xi β)2) / (∑ i ≤ n (yi – yavg)2) · (n – 1) / (n – m – 1) • Pearson residual: ri = (yi – Xi β) / Var(ε)1/2 – Should be approximately Gaussian with mean 0 and variance 1 – Can use in another fitness test (more on tests later) 16
  17. 17. LinReg Scripts: Inputs # INPUT PARAMETERS: # -------------------------------------------------------------------------------------------- # NAME TYPE DEFAULT MEANING # -------------------------------------------------------------------------------------------- # X String --- Location (on HDFS) to read the matrix X of feature vectors # Y String --- Location (on HDFS) to read the 1-column matrix Y of response values # B String --- Location to store estimated regression parameters (the betas) # O String " " Location to write the printed statistics; by default is standard output # Log String " " Location to write per-iteration variables for log/debugging purposes # icpt Int 0 Intercept presence, shifting and rescaling the columns of X: # 0 = no intercept, no shifting, no rescaling; # 1 = add intercept, but neither shift nor rescale X; # 2 = add intercept, shift & rescale X columns to mean = 0, variance = 1 # reg Double 0.000001 Regularization constant (lambda) for L2-regularization; set to nonzero # for highly dependend/sparse/numerous features # tol Double 0.000001 Tolerance (epsilon); conjugate graduent procedure terminates early if # L2 norm of the beta-residual is less than tolerance * its initial norm # maxi Int 0 Maximum number of conjugate gradient iterations, 0 = no maximum # fmt String "text" Matrix output format for B (the betas) only, usually "text" or "csv" # -------------------------------------------------------------------------------------------- # OUTPUT: Matrix of regression parameters (the betas) and its size depend on icpt input value: # OUTPUT SIZE: OUTPUT CONTENTS: HOW TO PREDICT Y FROM X AND B: # icpt=0: ncol(X) x 1 Betas for X only Y ~ X %*% B[1:ncol(X), 1], or just X %*% B # icpt=1: ncol(X)+1 x 1 Betas for X and intercept Y ~ X %*% B[1:ncol(X), 1] + B[ncol(X)+1, 1] # icpt=2: ncol(X)+1 x 2 Col.1: betas for X & intercept Y ~ X %*% B[1:ncol(X), 1] + B[ncol(X)+1, 1] # Col.2: betas for shifted/rescaled X and intercept 17
  18. 18. LinReg Scripts: Outputs # In addition, some regression statistics are provided in CSV format, one comma-separated # name-value pair per each line, as follows: # # NAME MEANING # ------------------------------------------------------------------------------------- # AVG_TOT_Y Average of the response value Y # STDEV_TOT_Y Standard Deviation of the response value Y # AVG_RES_Y Average of the residual Y - pred(Y|X), i.e. residual bias # STDEV_RES_Y Standard Deviation of the residual Y - pred(Y|X) # DISPERSION GLM-style dispersion, i.e. residual sum of squares / # deg. fr. # PLAIN_R2 Plain R^2 of residual with bias included vs. total average # ADJUSTED_R2 Adjusted R^2 of residual with bias included vs. total average # PLAIN_R2_NOBIAS Plain R^2 of residual with bias subtracted vs. total average # ADJUSTED_R2_NOBIAS Adjusted R^2 of residual with bias subtracted vs. total average # PLAIN_R2_VS_0 * Plain R^2 of residual with bias included vs. zero constant # ADJUSTED_R2_VS_0 * Adjusted R^2 of residual with bias included vs. zero constant # ------------------------------------------------------------------------------------- # * The last two statistics are only printed if there is no intercept (icpt=0) # # The Log file, when requested, contains the following per-iteration variables in CSV # format, each line containing triple (NAME, ITERATION, VALUE) with ITERATION = 0 for # initial values: # # NAME MEANING # ------------------------------------------------------------------------------------- # CG_RESIDUAL_NORM L2-norm of Conj.Grad.residual, which is A %*% beta - t(X) %*% y # where A = t(X) %*% X + diag (lambda), or a similar quantity # CG_RESIDUAL_RATIO Ratio of current L2-norm of Conj.Grad.residual over the initial # ------------------------------------------------------------------------------------- 18
  19. 19. Caveats • Overfitting: β reflect individual records in X, not distribution – Typically, too few records (small n) or too many features (large m) – To detect, use cross-validation – To mitigate, select fewer features; regularization may help too • Outliers: Some records in X are highly abnormal – They badly violate distribution, or have very large cell-values – Check MIN and MAX of Y, X-columns, Xi β, ri 2 = (yi – Xi β)2 / Var(ε) – To mitigate, remove outliers, or change distribution or link function • Interpolation vs. extrapolation – A model trained on one kind of data may not carry over to another kind of data; the past may not predict the future – Great research topic! 19
  20. 20. Generalized Linear Models • Linear Regression: Y = Xβ* + ε – Each yi is Normal(μi , σ2) where mean μi = Xi β* – Variance(yi) = σ2 = constant • Logistic Regression: – Each yi is Bernoulli(μi) where mean μi = 1 / (1 + exp (– Xi β*)) – Prob [yi = 1] = μi , Prob [yi = 0] = 1 – μi , mean = probability of 1 – Variance(yi) = μi (1 – μi) • Poisson Regression: – Each yi is Poisson(μi) where mean μi = exp(Xi β*) – Prob [yi = k] = (μi)k exp(– μi)/ k! for k = 0, 1, 2, … – Variance(yi) = μi • Only in Linear Regression we add error εi to mean μi 20
  21. 21. Generalized Linear Models • GLM Regression: – Each yi has distribution = exp{(yi ·θi – b(θi))/a + c(yi , a)} – Canonical parameter θi represents the mean: μi = bʹ(θi) – Link function connects μi and Xi β* : Xi β* = g(μi), μi = g –1 (Xi β*) – Variance(yi) = a ·bʺ(θi) • Example: Linear Regression as GLM – C(σ)·exp(– (yi – Xi β)2 / 2σ2) = exp{(yi ·θi – b(θi))/a + c(yi , a)} – θi = μi = Xi β; b(θi) = (Xi β)2 / 2; a = σ2 = variance • Link function = identity; c(yi , a) = – yi 2 /2σ2 + log C(σ) • Example: Logistic Regression as GLM – (μi )y[i] (1 – μi)1 – y[i] = exp{yi · log(μi) – yi · log(1 – μi) + log(1 – μi)} = exp{(yi ·θi – b(θi))/ a + c(yi , a)} – θi = log(μi / (1 – μi)) = Xi β; b(θi) = – log(1 – μi) = log(1 + exp(θi)) • Link function = log (μ / (1 – μ)) ; Variance = μ(1 – μ) ; a = 1 21
  22. 22. Generalized Linear Models • GLM Regression: – Each yi has distribution = exp{(yi ·θi – b(θi))/a + c(yi , a)} – Canonical parameter θi represents the mean: μi = bʹ(θi) – Link function connects μi and Xi β* : Xi β* = g(μi), μi = g –1 (Xi β*) – Variance(yi) = a ·bʺ(θi) • Why θi ? What is b(θi)? – θi makes formulas simpler, stands for μi (no big deal) – b(θi) defines what distribution it is: linear, logistic, Poisson, etc. – b(θi) connects mean with variance: Var(yi) = a·bʺ(θi), μi = bʹ(θi) • What is link function? – You choose it to link μi with your features β1xi1 + β2xi2 + … + βmxim – Additive effects: μi = Xi β; Multiplicative effects: μi = exp(Xi β) Bayes law effects: μi = 1 / (1 + exp (– Xi β)); Inverse: μi = 1 / (Xi β) – Xi β has range (– ∞, +∞), but μi may range in [0, 1], [0, +∞) etc. 22
  23. 23. GLMs We Support • We specify GLM by: – Mean to variance connection – Link function (mean to feature sum connection) • Mean-to-variance for common distributions: – Var (yi) = a ·(μi)0 = σ2 : Linear / Gaussian – Var (yi) = a ·μi (1 – μi): Logistic / Binomial – Var (yi) = a ·(μi)1 : Poisson – Var (yi) = a ·(μi)2 : Gamma – Var (yi) = a ·(μi)3 : Inverse Gaussian • We support two types: Power and Binomial – Var (yi) = a ·(μi)α : Power, for any α – Var (yi) = a ·μi (1 – μi): Binomial 23
  24. 24. GLMs We Support • We specify GLM by: – Mean to variance connection – Link function (mean to feature sum connection) Supported link functions • Power: Xi β = (μi)s where s = 0 stands for Xi β = log (μi) – Examples: identity, inverse, log, square root • Link functions used in binomial / logistic regression: – Logit, Probit, Cloglog, Cauchit – Link Xi β-range (– ∞, +∞) with μi-range (0, 1) – Differ in tail behavior • Canonical link function: – Makes Xi β = the canonical parameter θi , i.e. sets μi = bʹ(Xi β) – Power link Xi β = (μi)1 – α if Var = a·(μi)α ; Logit link for binomial 24
  25. 25. GLM Script Inputs # NAME TYPE DEFAULT MEANING # --------------------------------------------------------------------------------------------- # X String --- Location to read the matrix X of feature vectors # Y String --- Location to read response matrix Y with either 1 or 2 columns: # if dfam = 2, Y is 1-column Bernoulli or 2-column Binomial (#pos, #neg) # B String --- Location to store estimated regression parameters (the betas) # fmt String "text" The betas matrix output format, such as "text" or "csv" # O String " " Location to write the printed statistics; by default is standard output # Log String " " Location to write per-iteration variables for log/debugging purposes # dfam Int 1 Distribution family code: 1 = Power, 2 = Binomial # vpow Double 0.0 Power for Variance defined as (mean)^power (ignored if dfam != 1): # 0.0 = Gaussian, 1.0 = Poisson, 2.0 = Gamma, 3.0 = Inverse Gaussian # link Int 0 Link function code: 0 = canonical (depends on distribution), # 1 = Power, 2 = Logit, 3 = Probit, 4 = Cloglog, 5 = Cauchit # lpow Double 1.0 Power for Link function defined as (mean)^power (ignored if link != 1): # -2.0 = 1/mu^2, -1.0 = reciprocal, 0.0 = log, 0.5 = sqrt, 1.0 = identity # yneg Double 0.0 Response value for Bernoulli "No" label, usually 0.0 or -1.0 # icpt Int 0 Intercept presence, X columns shifting and rescaling: # 0 = no intercept, no shifting, no rescaling; # 1 = add intercept, but neither shift nor rescale X; # 2 = add intercept, shift & rescale X columns to mean = 0, variance = 1 # reg Double 0.0 Regularization parameter (lambda) for L2 regularization # tol Double 0.000001 Tolerance (epsilon) # disp Double 0.0 (Over-)dispersion value, or 0.0 to estimate it from data # moi Int 200 Maximum number of outer (Newton / Fisher Scoring) iterations # mii Int 0 Maximum number of inner (Conjugate Gradient) iterations, 0 = no maximum # --------------------------------------------------------------------------------------------- # OUTPUT: Matrix beta, whose size depends on icpt: # icpt=0: ncol(X) x 1; icpt=1: (ncol(X) + 1) x 1; icpt=2: (ncol(X) + 1) x 2 25
  26. 26. GLM Script Outputs # In addition, some GLM statistics are provided in CSV format, one comma-separated name-value # pair per each line, as follows: # ------------------------------------------------------------------------------------------- # TERMINATION_CODE A positive integer indicating success/failure as follows: # 1 = Converged successfully; 2 = Maximum number of iterations reached; # 3 = Input (X, Y) out of range; 4 = Distribution/link is not supported # BETA_MIN Smallest beta value (regression coefficient), excluding the intercept # BETA_MIN_INDEX Column index for the smallest beta value # BETA_MAX Largest beta value (regression coefficient), excluding the intercept # BETA_MAX_INDEX Column index for the largest beta value # INTERCEPT Intercept value, or NaN if there is no intercept (if icpt=0) # DISPERSION Dispersion used to scale deviance, provided as "disp" input parameter # or estimated (same as DISPERSION_EST) if the "disp" parameter is <= 0 # DISPERSION_EST Dispersion estimated from the dataset # DEVIANCE_UNSCALED Deviance from the saturated model, assuming dispersion == 1.0 # DEVIANCE_SCALED Deviance from the saturated model, scaled by the DISPERSION value # ------------------------------------------------------------------------------------------- # # The Log file, when requested, contains the following per-iteration variables in CSV format, # each line containing triple (NAME, ITERATION, VALUE) with ITERATION = 0 for initial values: # ------------------------------------------------------------------------------------------- # NUM_CG_ITERS Number of inner (Conj.Gradient) iterations in this outer iteration # IS_TRUST_REACHED 1 = trust region boundary was reached, 0 = otherwise # POINT_STEP_NORM L2-norm of iteration step from old point (i.e. "beta") to new point # OBJECTIVE The loss function we minimize (i.e. negative partial log-likelihood) # OBJ_DROP_REAL Reduction in the objective during this iteration, actual value # OBJ_DROP_PRED Reduction in the objective predicted by a quadratic approximation # OBJ_DROP_RATIO Actual-to-predicted reduction ratio, used to update the trust region # GRADIENT_NORM L2-norm of the loss function gradient (NOTE: sometimes omitted) # LINEAR_TERM_MIN The minimum value of X %*% beta, used to check for overflows # LINEAR_TERM_MAX The maximum value of X %*% beta, used to check for overflows # IS_POINT_UPDATED 1 = new point accepted; 0 = new point rejected, old point restored # TRUST_DELTA Updated trust region size, the "delta" # ------------------------------------------------------------------------------------------- 26
  27. 27. GLM Likelihood Maximization • 1 record: ℓ (yi | θi , a) = exp{(yi ·θi – b(θi))/ a + c(yi , a)} • Log ℓ (Y |Θ, a) = 1/a · ∑ i ≤ n (yi · θi – b(θi)) + const(Θ) • f(β; X, Y) = – ∑i ≤ n (yi · θi – b(θi)) + λ/2 · βT β → min – Here θi is a function of β: θi = bʹ–1 (g –1 (Xi β)) – Add regularization with λ/2 to agree with least squares – If X has intercept, do NOT regularize its β-value • Non-quadratic; how to optimize? – Gradient descent: fastest when far from optimum – Newton method: fastest when close to optimum • Trust Region Conjugate Gradient – Strikes a good balance between the above two 27
  28. 28. GLM Likelihood Maximization • f(β; X, Y) = – ∑i ≤ n (yi · θi – b(θi)) + λ/2 · βT β → min • Outer iteration: From β to βnew = β + z – ∆f (z; β) := f(β + z; X, Y) – f(β; X, Y) • Use “Fisher Scoring” to approximate Hessian and ∆f (z; β) – ∆f (z; β) ≈ ½·zT A z + GT z, where: – A = XT diag(w)X + λI and G = – XT u + λ·β – Vectors u, w depend on β via mean-to-variance and link functions • Trust Region: Area ǁzǁ2 ≤ δ where we trust the approximation ∆f (z; β) ≈ ½ ·zT A z + GT z – ǁzǁ2 ≤ δ too small → Gradient Descent step (1 inner iteration) – ǁzǁ2 ≤ δ mid-size → Cut-off Conjugate Gradient step (2 or more) – ǁzǁ2 ≤ δ too wide → Full Conjugate Gradient step FI = XT diag(w) X is “expected” Hessian 28
  29. 29. Trust Region Conj. Gradient • Code snippet for Logistic Regression g = - 0.5 * t(X) %*% y; f_val = - N * log (0.5); delta = 0.5 * sqrt (D) / max (sqrt (rowSums (X ^ 2))); exit_g2 = sum (g ^ 2) * tolerance ^ 2; while (sum (g ^ 2) > exit_g2 & i < max_i) { i = i + 1; r = g; r2 = sum (r ^ 2); exit_r2 = 0.01 * r2; d = - r; z = zeros_D; j = 0; trust_bound_reached = FALSE; while (r2 > exit_r2 & (! trust_bound_reached) & j < max_j) { j = j + 1; Hd = lambda * d + t(X) %*% diag (w) %*% X %*% d; c = r2 / sum (d * Hd); [c, trust_bound_reached] = ensure_quadratic (c, sum(d^2), 2 * sum(z*d), sum(z^2) - delta^2); z = z + c * d; r = r + c * Hd; r2_new = sum (r ^ 2); d = - r + (r2_new / r2) * d; r2 = r2_new; } p = 1.0 / (1.0 + exp (- y * (X %*% (beta + z)))); f_chg = - sum (log (p)) + 0.5 * lambda * sum ((beta + z) ^ 2) - f_val; delta = update_trust_region (delta, sqrt(sum(z^2)), f_chg, sum(z*g), 0.5 * sum(z*(r + g))); if (f_chg < 0) { beta = beta + z; f_val = f_val + f_chg; w = p * (1 - p); g = - t(X) %*% ((1 - p) * y) + lambda * beta; } } ensure_quadratic = function (double x, a, b, c) return (double x_new, boolean test) { test = (a * x^2 + b * x + c > 0); if (test) { rad = sqrt (b ^ 2 - 4 * a * c); if (b >= 0) { x_new = - (2 * c) / (b + rad); } else { x_new = - (b - rad) / (2 * a); } } else { x_new = x; } } 29
  30. 30. Trust Region Conj. Gradient • Trust region update in Logistic Regression snippet update_trust_region = function (double delta, double z_distance, double f_chg_exact, double f_chg_linear_approx, double f_chg_quadratic_approx) return (double delta) { sigma1 = 0.25; sigma2 = 0.5; sigma3 = 4.0; if (f_chg_exact <= f_chg_linear_approx) { alpha = sigma3; } else { alpha = max (sigma1, - 0.5 * f_chg_linear_approx / (f_chg_exact - f_chg_linear_approx)); } rho = f_chg_exact / f_chg_quadratic_approx; if (rho < 0.0001) { delta = min (max (alpha, sigma1) * z_distance, sigma2 * delta); } else { if (rho < 0.25) { delta = max (sigma1 * delta, min (alpha * z_distance, sigma2 * delta)); } else { if (rho < 0.75) { delta = max (sigma1 * delta, min (alpha * z_distance, sigma3 * delta)); } else { delta = max (delta, min (alpha * z_distance, sigma3 * delta)); }}} } 30
  31. 31. GLM: Other Statistics • REMINDER: – Each yi has distribution = exp{(yi ·θi – b(θi))/a + c(yi , a)} – Variance(yi) = a ·bʺ(θi) = a·V(μi) • Variance of Y given X – Estimating the β gives V(μi) = V (g–1 (Xi β)) – Constant “a” is called dispersion, analogue of σ2 – Estimator: a ≈ 1/(n – m)·∑ i ≤ n (yi – μi)2 / V(μi) • Variance of parameters β – We use MLE, hence Cramér-Rao formula applies (for large n) – Fisher Information: FI = (1/a)· XT diag(w)X, wi = (V(μi) ·gʹ(μi)2)–1 – Estimator: Cov β ≈ a·(XT diag(w)X)–1, Var βj = (Cov β)jj 31
  32. 32. GLM: Deviance • Let X have m features, of which k may have no effect on Y – Will “no effect” result in βj ≈ 0 ? (Unlikely.) – Estimate βj and Var βj then test βj / (Var βj)1/2 against N(0, 1)? • Student’s t-test is better • Likelihood Ratio Test: • Null Hypothesis: Y given X follows GLM with β1 = … = βk = 0 – If NH is true, D is asympt. distributed as χ2 with k deg. of freedom – If NH is false, D → +¥as n → +¥ • P-value % = Prob[ χ2 k > D] · 100% ( ) ( ) 0 ...,,,0...,,0;,|max ...,,,...,,;,|max log2 1GLM 11GLM > ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ ⋅= + + mk mkk aXYL aXYL D ββ ββββ β β 32
  33. 33. GLM: Deviance • To test many nested models (feature subsets) we need their maximum likelihoods to compute D – PROBLEM: Term “c(yi , a)” in GLM’s exp{(yi ·θi – b(θi))/ a + c(yi , a)} • Instead, compute deviance: • “Saturated model” has no X, no β, but picks the best θi for each individual yi (not realistic at all, just convention) – Term “c(yi , a)” is the same in both models! – But “a” has to be fixed, e.g. to 1 • Deviance itself is used for goodness of fit tests, too ( ) ( ) 0 ...,,,...,,;,|max modelsaturated:;|max log2 11GLM GLM > ⎥ ⎥ ⎦ ⎤ ⎢ ⎢ ⎣ ⎡ Θ ⋅= + Θ mkkaXYL aYL D ββββ β 33
  34. 34. Survival Analysis Given Survival data from individuals as (time, event) Categorical/continuous features for each individual Estimate Probability of survival to a feature time Rate of hazard at a given time Ex. † death from specific cancer ? lost to follow-up † † ? † ? 1 2 3 4 5 6 7 8 9 I I I I I I I I I Patients 2 1 3 4 5 Time 27 34
  35. 35. Cox Regression Semi-parametric model “robust” Most commonly used Handles categorical and continuous data Handles (right/left/interval) censored data Baseline hazard covariates coefficients 29 35
  36. 36. 36 Event Hazard Rate • Symptom events E follow a Poisson process: timeE1 E2 E3 E4 Death Hazard function Hazard function = Poisson rate: Given state and hazard, we could compute the probability of the observed event count: [ ] t tttE th t Δ Δ+∈ = →Δ state),[Prob limstate);( | 0 [ ] , ! ineventsProb 21 K He tttK KH− =≤≤ dttthH t t ))(state;( 2 1 ∫=
  37. 37. 37 Cox Proportional Hazards • Assume that exactly 1 patient gets event E at time t • The probability that it is Patient #i is the hazard ratio: • Cox assumption: • Time confounder cancels out! t [ ] ∑ = = n j ji sthsthEi 1 );();(gets#Prob s1 si = statei s2 sn Patient #1 Patient #2 Patient #3 Patient #n – 1 Patient #n . . . . . )(exp)((state))(state);( T 00 sththth λ⋅=Λ⋅=
  38. 38. 38 Cox “Partial” Likelihood • Cox “partial” likelihood for the dataset is a product over all E: Patient #1 Patient #2 Patient #3 Patient #n – 1 Patient #n . . . . . [ ] ∏ ∑ ∏ ∑ == === EtEt n j j t n j j t ts ts tsth tsth EL :: 1 T )(who T 1 )(who Cox )( )( )( )( )(exp )(exp )(; )(; allProb)( λ λ λ
  39. 39. Cox Regression Semi-parametric model “robust” Most commonly used Handles categorical and continuous data Handles (right/left/interval) censored data Cox regression in DML Fitting parameters via negative partial log-likelihood Method: trust region Newton with conjugate gradient Inverting the Hessian using block Cholesky for computing std error of betas Similar features as coxph() in R, e.g., stratification, frequency weights, offsets, goodness of fit testing, recurrent event analysis Baseline hazard covariates coefficients 29 39
  40. 40. BACK-UP 40
  41. 41. Kaplan-Meier Estimator 28 41
  42. 42. Kaplan-Meier Estimator 28 42
  43. 43. Confidence Intervals • Definition of Confidence Interval; p-value • Likelihood ratio test • How to use it for confidence interval • Degrees of freedom 43

×