2. Outline
What is survival analysis?
Censored and truncated data
Life table
Kaplan-meier estimator
Log-rank test
Cox regression model
3. Survival Analysis
We estimate and compare means and proportions by
confidence intervals and hypothesis testing
We also make predictions by regression models
But these methods cannot usually be used for ‘survival’ data
This is because survival data differ from the types of data we
have studied so far in two important aspects:
Some observations do not experience the event of interest
when the study period is completed (censored)
Survival times are hardly ever normally distributed (skewed)
4. What is survival analysis?
It is a branch of statistics that focuses on time-to-event
data and their analysis.
Survival data deal with time until occurrence of any well-
defined event.
The outcome variable examined is the survival time (the
time until the occurrence of the event).
It is especial because it can incorporate information
about censored data into analysis.
5. Objectives of survival analysis
Estimates the probability that an individual surpasses some
time-to-event.
E.g. The probability of surviving longer than two months
until second heart attach for a group of MI patients.
Compare time-to-event between two or more groups.
E.g. Treatment vs placebo patients for a randomized
controlled trial.
Assess the relationship of covariates to time-to-event.
E.g. Does weight, BP, sugar, height influence the survival
time for a group of patients?
6. Survival analysis
In order to define a failure time random variable, we need:
An unambiguous time origin. (e.g. date of randomization
to clinical trial, time of exposure etc.)
A real time ( e.g. days, years)
Definition of the event (e.g. death,)
7. Survival analysis
You can use survival analysis when you wish to analyze survival
times or “time-to-event” intervals like:
From diagnosis to death
Time until response to a treatment
From exposure to development of symptom of disease
From treatment of infertility to conception
From the start of treatment to its failure
Time until resumption of smoking by someone who had quit
Time until certain percentage of weight loss
The statistical treatment of survival times (survival data) is
known as survival analysis.
8. 8
Truncation and Censoring
Truncation is about entering the study
Right: Event has already occurred (e.g. cancer registry)
Left: “staggered entry”
Whereas, censoring is about leaving the study. This is because
survival data can be one of two types:
Complete data – the value of each sample unit is observed or
known.
Censored data – the time to the event of interest may not be
observed or the exact time is not known.
9. Truncation and Censoring cont…
Censored data can occur when:
The event of interest is death, but the patient is still alive at
the time of analysis.
The individual was lost to follow-up without having the event
of interest.
The event of interest is death by cancer but the patient died
of an unrelated cause, such as a car accident.
The patient is dropped from the study without having
experienced the event of interest due to a protocol violation.
Even if an observation is censored we will still include it in our
analysis.
10. 10
Type of Censoring
The most common type of censoring occurs when the event in
question has not yet occurred as of the time of last observation.
This type of censoring is called right censoring.
A follow up time is left censored if we know that the event of
interest took place at unknown time prior to the actual
observed time.
Example: In a study modeling the age at which “regular”
smoking starts, a 12 year old subject may report that he is a
regular smoker but that he doesn’t remember when he started
smoking regularly
11. A ………..
B ._______________________________.
C ._______________________________...............
D ._____________________________.......
Recruitment interval Additional follow up interval
In clinical and some public health studies, participants are typically recruited
over a recruitment interval (this is called staggered entry) and then followed for
an additional period of time.
Type of Censoring cont..
12. 12
Types of Censoring
A is left censored.
B is fully observed.
C is right censored because the observation is lost to study.
This type of right censoring does not cause any problems if the
censoring is random.
D is right censored because the observation period ends
before the event has occurred. This type of censoring does not
cause any problems for the analysis.
13. 13
Assumptions
Any standard method of survival analysis deals with right
censoring and left truncation, and has the following assumptions:
Those at risk at time t are a random sample from the population of
interest at risk at time t. (This is called non-informative or
independent assumption)
That means among those with the same values of X (group),
censored subjects must be at similar risk of subsequent events as
subjects with continued follow-up
Censoring is inappropriate if the censoring mechanism is in any
way related to the probability of the event of interest. We call such
mechanisms ‘informative censoring’.
14. 14
Life table
The set of probabilities used in estimating the probability of the
occurrence of an event or survival at each year and the cumulative
probability of survival to each year is called a life table.
To carry out the calculation, we first set out for each year (X) :
the number alive at the start = nx
the number withdrawn during the year=wx
the number at risk = rx
the number dying = dx
15. 15
Life table calculation for parathyroid cancer survival:
the survival times are given in years after diagnosis
Year
( x )
Number at start
( nx )
Withdrawn
during year
( wx )
At risk
( rx )
Deaths
( dx )
Prob. of death
( qx )
Prob. of surviving
year X
( px )
Cumulative prob. of
surviving x years
( Px )
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
20
17
15
14
14
13
12
9
8
7
5
3
2
2
2
1
1
1
2
2
0
0
1
1
1
0
1
0
2
0
0
0
0
0
0
1
19
16
15
14
13.5
12.5
11.5
9
7.5
7
4
3
2
2
2
1
1
0.5
1
0
1
0
0
0
2
1
0
2
0
1
0
0
1
0
0
0
0.0526
0
0.0667
0
0
0
0.1739
0.1111
0
0.2857
0
0.3333
0
0
0.5000
0
0
0
0.9474
1
0.9333
1
1
1
0.8261
0.8889
1
0.7143
1
0.6667
1
1
0.5000
1
1
1
0.9474
0.9474
0.8842
0.8842
0.8842
0.8842
0.7304
0.6493
0.6493
0.4638
0.4638
0.3092
0.3092
0.3092
0.1546
0.1546
0.1546
0.1546
rx=nx-½wx, qx=dx/rx, px=1-qx, Px=pxPx-1
16. Function describing survival times
Let T be a random variable that represents survival time
The distribution of survival time can be described by
the survival function, S(t)
S(t) = P(T > t)
S(t) is the probability that a subject selected randomly
survives longer than time t
Properties
S(t=0) = 1
S(t) is bounded by 0 and 1, since it is a probability
S(t) is a non-increasing function
17. Function ….. cont’d
The median survival time (call it τ ) is just the time where 50%
of the observations have experienced the event.
That means median survival time is the time where S(τ ) = 0.5
In practice, however, we don’t usually hit the median survival
at exactly one of the failure times. In this case, the estimated
median survival is the smallest time τ such that: S(τ ) < 0.5
18. Survival function
If there is no censoring, then a good estimator of S(t) at
time t, is:
S(t) = number of patients surviving longer than time t
total number of patients on trial
= Simply the proportion
But usually there is censoring. Therefore, we can best
estimate S(t) using the Kaplam-Meier estimator or life
table
20. Kaplan-Meier (KM) estimator
KM estimator helps us to find S(t) when there are censored
data.
To find KM estimator, we break up survival probability into a
sequence of conditions
The probability of surviving t (t > 2) or more years from the
beginning of the study is the product of the observed survival
rates. i.e. S(t) = p1p2p3…pt
Note that if all the data are uncensored, the numerator of pi
cancel out with the denominator of pi+1 to give (nt-dt)/n0 which
is simply the proportion (look in the previous & the next slide)
21. Kaplan-meier estimator
Mathematically we can put KM estimator as:
Pj = estimated by the proportion of people living through tj out
those who have survived beyond tj-1
nj = Number at risk at time tj
dj = Number who died at time tj
nj – dj = Number who survived beyond tj
By convention , unlike life table , if any subjects are censored at time
tj, then they are considered to have survived for longer than the time tj
and adjustments of the form of (nj = 1/2wj) are not applied.
22. How to calculate the KM estimator
(Parathyroid cancer data)
E.g. -1: We have nine event times and ‘+’ indicates time of censoring
1+, 1+, 1, 2+, 2+, 3, 5+, 6+, 7+, 7, 7, 8, 9+, 10, 10, 11+, 11+, 12, 15, 18+
Recall that:
Ŝ(1) = Ŝ(0)p1 = (1)(19/20) = 0.9500
Ŝ(3) = Ŝ(1)p3 = (0.9500)(14/15) = 0.8867
Ŝ(7) = Ŝ(3)p7 = (0.8867)(10/12) = 0.7389
Ŝ(8) = Ŝ(7)p8 = (0.7389)(8/9) = 0.6568
Ŝ(10) = Ŝ(8)p10 = (0.6568)(5/7) = 0.4691
Ŝ(12) = Ŝ(10)p12 = (0.4691)(2/3) = 0.3128
Ŝ(15) = Ŝ(12)p15 = (0.3128)(1/2) = 0.1564
Take notice of why the
survival time using KM is
different from the survival
time from life table which
is due to the difference in
conventions in treating
censored data while
calculating number at risk
Where the product is taken over all
time intervals in which a death
occurred , up to and including t
24. Kaplan Meier curve cont…
Example-2; Motion sickness data: In an experiment of two
drugs on 49 passengers to delay vomiting, there were 21
passengers in the first experiment (drug) of which five of
them were definite events (vomiting) at 30, 50, 51, 82, and
92 minutes.
In the second experiment, there were 28 individuals from
which 14 were the events at 5, 13, 24, 63, 65, 79, 102, and
115 minutes and 2 each at 11, 69, and 82 minutes.
26. Kaplan Meier curve
Variable definition for motion sickness data (study):
Time is the time in minute from the point of randomization to
either vomiting or censoring
Status has a value of 1 if a passenger vomited and a value of 0 if
censored. This tells us that the censored value will be 0 if a
passenger did not vomit till the end of the study
Drug specifies a value of 1 or 2 that corresponds to treatment
1 and treatment 2 respectively
27. How to use SPSS
Analyze > Survival > Kaplan Meier
Time: Time
Status: status(1)
Here define 1 since it is the value indicating event has
occurred (i.e. vomiting)
Options: Check the survival plot
Kaplan Meier curve …
28. Kaplan Meier curve …
If the last observation is uncensored, the K-M estimate at
that time equals zero
Each time there is a censoring, the denominator in the life table
changes, but the plot in K-M curve stays the same. Censorings are
marked by hash-marks.
29. 29
Limitations of Kaplan-Meier
Mainly descriptive
Doesn’t control for covariates
Requires categorical predictors
Can’t accommodate time-dependent variables
31. Cox regression Models
Cox regression is a regression method introduced by Sir Cox
in 1972
It is a model that relates the time that passes before some
event occurs to one or more covariates that may be
associated with that amount of time.
It is also known as proportional hazard regression analysis.
32. Cox regression model
This model produces a survival function that predicts the
probability that an event has occurred at a given time t, for
given predictor variables (covariates).
Unlike linear regression, survival analysis has a dichotomous
(binary) outcome
Unlike logistic regression, it analyzes the time to an event, and
has the following added values
Able to account for censoring
Can compare survival between two or more groups
Assess relationship between covariates and survival time
33. 33
Cox Regression model…
It is a semi-parametric model
Cox regression models the effect of predictors and covariates
on the hazard rate but leaves the baseline hazard rate
unspecified.
It does NOT assume knowledge of absolute risk.
Rather it estimates relative rather than absolute risk.
34. Cox regression model
h(t) = ho(t)eβXi + βo)
t is the time
Xi is the covariate for the ith individual
ho(t) is the baseline hazard function.
i.e. ho(t) is the function when all the covariates equal to
zero
36. Interpretation of the betas
First we need to find the ratio when there is a one unit increase
in the covariate, provided the other covariate remain fixed.
h(t, x1+1) = ho(t)eβ(x+1) = eβ
h(t, x1) ho(t)eβ(x)
β is the increase in log hazard ratio for a unit increase in
covariate X
Note that Censored cases are not used in the computation of the
regression coefficients, but are used to compute the baseline
hazards.
37. The Cox regression model cont…
The time variable should be quantitative.
The status variable can be categorical or continuous.
The independent variables (covariates) can be continuous or
categorical;
If categorical, they should be dummy or indicator coded (there
is an option in the procedure to recode categorical variables
automatically).
37
38. The Cox regression model cont…
If there is one or no categorical covariate, the Kplan-Meier
or Life Table procedure can be used
If there is no censored data in the sample, the linear
regression procedure can be used to model the relationship
between predictors and time to event.
38
39. The Cox regression model cont…
SPSS output for motion sickness data
Interpretation:
The hazard of vomiting for patients receiving treatment 1 is
25.4% of that on treatment 2 patients.
Variables in the Equation
B SE Wald df Sig. Exp(B)
95.0% CI for
Exp(B)
Lower Upper
treatment -1.372 .547 6.286 1 .012 .254 .087 .741
40. The Cox regression model cont…
Variable Coefficient Standard error 95% CI for B Exp(B) P-value
Multiple gal
stone
0.838 0.401 0.046 to1.631 2.313 0.036
Maximum
diameter
-0.023 0.036 -0.094 to 0.049 0.978 0.531
Months to
dissolve
0.044 0.017 0.011 to 0.078 1.045 0.008
Example -2 Recurrence of gallbladder (data can be accessed from SPSS
template or Martin B.)
The chi-squared statistics tests the relationship between the
time to recurrence and the three variables together
A positive b coefficient shows an increased risk of the event, in
this case recurrence.
41. Cox regression cont…
The maximum diameter has no significance relationship to time
to recurrence,
The coefficient for multiple gall stones is 0.838. If we antilog
this, we get exp(0.838) = 2.31
This is interpreted as a patient with multiple gall stone is 2.31
times as likely to have a recurrence at any time as a patient with
a single stone.
The 95% CI for this estimate (relative hazard) is 1.05 to 5.11
. 41
42. Cox regression cont…
Example -3
In a cancer drug trial, 37 patients were randomized to the treatment
group and 32 patients to the control group. Their survival times
(until death) are measured in months and some observations are
censored. (variables: group. Sex, and age)
Result of Cox regression for the cancer trial example
42
Explanatory variable Hazard Ratio 95% CI P-value
Group
Control
Treatment
1.0
0.1052 0.086-0.262 <0.0001
Sex
Male
Female
1.0
0.9127 0.732-1.366 0.4342
Age 1.127 1.103-1.152 0.002
43. Cox regression cont…
Interpretation
The death hazard in the treatment group is 0.1052 times (95 per
cent CI: 0.086–0.262) than in the control group, reducing the risk
by almost 90 per cent at any given time. (p < 0.0001)
The death hazard for females does not significantly differ from that
for males (the CI includes 1.0, the p-value is large)
Each 1-year increase in age results in the death hazard increasing
by a factor of 1.127, a p-value of 0.002:
43
If there are censored observations, then S(t)˜ is not a good estimate of the true S(t), so other non-parametric methods must be used to account for censoring (life-table methods, Kaplan-Meier estimator)
There are three major types of survival analysis techniques, differing in the assumptions that need to be made. As a metaphor, a pair of trousers is a parametric model. More exactly it is a 2-parameter model, one on waist (wogeb) circumference and one on leg length. In contrast, a skirt (kemiss) with an elastic waist is a non-parametric model. This is unlikely to fit well but it never fails badly. If you don’t know or if you are unwilling to guess the body size (the data), you can buy a skirt with an elastic waist (a non-parametric model). If you are confident that you know the body size quite well, you can buy a pair of trousers (a parametric model) and get a better fit. A semi-parametric model is a trade-off between the two extremes.
An introductory text covering all three types of techniques plus a medical focus is:
Collett D. Modelling Survival Data in Medical Research. London: Chapman & Hall, 1994.