Interval-Censored Survival Analysis
Primary Author: JohnKornak
Description
Survival analysis methods play an important role in the statistical analysis of medical data. In particular the advent of the Cox proportional hazards model coupled with the increasing computational power of the standard desktop computer, has led to widespread (and virtually automated) implementation of the Cox model for a wide range of survival data problems.
One notable exception for analyzing survival data for which the Cox model cannot be implemented "off-the-shelf" occurs when data are interval-censored: the usual quick and dirty approach of implementing the Cox model based on mid-points of intervals defining the time-of-event leads to conservative results (and referees are objecting more frequently to the use of mid-points). With interval-censored data, not only are many observations right-censored as in conventional survival data (that is the event [death] has not necessarily occurred by the time the subject is lost to follow up), but also for events that have occurred we do not have precise information as to when the event occurred; we only know that the event occurred within the last two follow up times.
The objectives of this document are to a) describe the tools available to the statistical consultant for analyzing interval-censored data; and b) provide guidance to efficiently approach interval-censored analysis (considering both the stability of implementation, as well as the ease of interpretability when disseminating results to medical researchers).
There are three primary approaches to dealing with interval-censored data: a) parametric modeling (accelerated failure time); b) non-parametric maximum likelihood (NPLME) Kaplan-Meier-Turnbull interval-censored methods; and c) complementary log-log link based ordered logistic regression proportional hazards modeling (this last approach requires that the intervals are somewhat consistent across subjects: e.g., patients might be followed up at pre-defined times post treatment such as 1 month, 3 months and 6 months post-surgery.
We now discuss options available in the two major statistical packages for each of these interval-censored analysis methods and how to implement the analyses:
R
The symbol > is used to define the R prompt, so that text beyond that corresponds to commands given to R.
Non-parametric:
- First, format the data for interval-censored survival analysis. There are two basic formats for interval-censored data in R, the easiest of which is called "interval2", defined as follows: a subject's data for the pair of columns in the dataset (time1, time2) is (t_e, t_e) if the event time t_e is known exactly; (t_l, NA) if right censored (where t_l is the censoring time); and (t_l, t_u) if interval censored (where t_l is the lower and t_u is the upper bound of the interval). An alternative format ("interval") is also available in R, but this format also requires a column for event status (type ?Surv in R after loading the survival package for more details if client provides data in such a format).
- Start R and load survival package > library(survival) - note that the survival package is already included with the standard R install.
- Load the interval package > library(interval) - note that on first use the package will need to be installed via > install.packages("interval"),
- Generate a survival object from the dataset
> YourSurv = Surv(YourData$time1,YourData$time2,type="interval2")
- Generate Kaplan-Meier-Turnbull non-parametric maximum likelihood (NPLME) survival curve estimates by group using e.g., KMTcurves = icfit(YourSurv ~ group, data = YourData). If there is no important grouping in the data a single curve can be generated with icfit(YourSurv ~ 1, data = YourData). Plot the curves via > plot(KMTcurves).
- A test for group differences can be performed via > ictest(YourSurv ~ group, data = YourData)
- Note that there is an additional package in R called intcox that performs proportional hazards analysis for interval-censored data using the iterant convex minorant algorithm. At the time of writing however, the package is not fully developed and does not provide confidence intervals and p-values for estimated hazard ratios. Consequently we do not suggest the use of intcox as a routine package for use in statistical consultation.
Parametric:
- First, format the data for interval-censored survival analysis. There are two basic formats for interval-censored data in R, the easiest of which is called "interval2", defined as follows: a subject's data for the pair of columns in the dataset (time1, time2) is (t_e, t_e) if the event time t_e is known exactly; (t_l, NA) if right censored (where t_l is the censoring time); and (t_l, t_u) if interval censored (where t_l is the lower and t_u is the upper bound of the interval). An alternative format ("interval") is also available in R, but this format also requires a column for event status (type ?Surv in R after loading the survival package for more details if client provides data in such a format). Note that it is critical for the implementation of parametric survival regression for a range of survival distributions (including the Weibull) that there are no zeros in the survival time data, even if the zero data points only correspond to the lower value of a censored interval. If there are zeros in the data R returns the error: "Error in survreg: Invalid survival times for this distribution". A trick to get around this problem when you have intervals that begin at time zero is to add a small constant to all survival times in the dataset.
- Load survival package > library(survival) - note that the survival package is included with the standard R install.
- Fit a parametric accelerated failure time model with the R function survreg. To fit a survival model with a Weibull distributed survival data use > modelfit = survreg(YourSurv ~ group + covariates, dist = "weibull", data = YourData) in the presence of additional covariates or for unadjusted model fits > modelUnadj = survreg(YourSurv ~ group, dist = "weibull", data = YourData). Note that the Weibull is the default distribution so in these examples the function would still work if the dist argument were omitted. However, for all other survival distributions the dist argument must be included.
- (For Weibull accelerated failure time model only) Generate hazard ratio estimates and confidence intervals from the parameters of the fitted Weibull accelerated failure time model: take the exponential of the model coefficient estimates and confidence intervals and then divide these by the fitted scale parameter (the scale parameter can be extracted via > modelfit$scale).
- Generate survival curves (unadjusted or adjusted) using the predict command to generate the curve points, e.g., for unadjusted:
predCurve = predict(modelUnadj,newdata=data.frame(group="group1"),type='quantile',p=pct) and then plot predCurve versus 1 - pct. For adjusted curves the values that the other covariates should be fixed at must also be given to the predict function, e.g., predict(modelfit, newdata=list(group = "group1", covariate=0), type='quantile',p=pct) will generate the curve based on the covariate taking the value 0.
Ordered logistic regression:
- Load the MASS package > library(MASS) - note that MASS package is included with the standard R install.
- Use the polr function with method = "cloglog". When the cloglog link is used, the ensuing ordered logistic model is of the proportional hazards form for survival times where the survival times can be considered as grouped within intervals (with the intervals being those defined by the discrete interval-censoring mechanism).
SAS
Non-parametric:
- Use the macros available for SAS for the purpose of fitting Kaplan-Meier-Turnbull non-parametric maximum likelihood (NPLME) survival curves and testing for group differences. The following reference gives details on the macros and their implementation: So, Y., Johnston, G., and Kim, S.H. (2010), "Analyzing Interval-Censored Survival Data with SAS\xAE Software," Proceedings of the SAS\xAE Global Forum 2010 Conference, Cary, NC: SAS Institute Inc. The SAS support page for these macros is available at http://support.sas.com/kb/24/980.html#ref.
Parametric:
- Format the data for interval-censored survival analysis as follows: a subject's data for the pair of columns in the dataset (time1, time2) is (t_e, t_e) if the event time t_e is known exactly; (t_l, .) if right censored (where t_l is the censoring time); and (t_l, t_u) if interval censored (where t_l is the lower and t_u is the upper bound of the interval). Note that it is critical for the implementation of parametric survival regression for a range of survival distributions (including the Weibull) that there are no zeros in the survival time data, even if the zero data points only correspond to the lower value of a censored interval. This is especially critical for proc lifereg as the zero observations are simply removed from the dataset without warning.
Implement the interval-censored parametric model using proc lifereg as follows (in this example the Weibull accelerated failure time model is given):
proc lifereg data=YourData;
class group ;
model (time1, time2) = group covariates / dist=Weibull;
run;
- (For Weibull accelerated failure time model only) Generate hazard ratio estimates and confidence intervals from the parameters of the fitted Weibull accelerated failure time model: take the exponential of the model coefficient estimates and confidence intervals and then divide these by the fitted scale parameter.
Ordered logistic regression:
- Run proc logistic with link=cloglog
General Proposed Analysis Strategy For Interval-Censored Data
- Are the intervals defined by a small discrete set of times that are consistent for all subjects? If so, the logistic regression based approach should be used.
- Is there a primary group comparison to be performed? If so, implement group test based on Kaplan-Meier-Turnbull based estimates
- Perform parametric accelerated failure time interval-censored analysis with Weibull-modeled survival times mirroring the (unadjusted) primary group comparison performed in 2). Note that the Weibull-model is recommended here primarily for interpretation reasons. The Weibull model for survival times is the only parametric distribution available that corresponds to a proportional hazards model and therefore is the only one for which interpretation in terms of hazard ratios can be made. Interpretation via hazard ratios is desirable because clinicians tend to have acquired some understanding of their meaning. Of course, if the proportional hazard assumption is violated then either alternative survival distributions or time-varying covariate methods must be used.
- Generate Hazard Ratios, confidence intervals and p-values from Weibull-model fitted parameter estimates.
- Compare parametric model fitted group survival curves to corresponding Kaplan-Meier-Turnbull curve estimates.
- Check whether proportional hazards or parametric modeling assumptions are reasonable via diagnostic probability and log-log plots. If assumptions are violated consider alternative parametric survival distributions, modeling non-proportionality with time-varying covariates, or semi-parametric baseline hazard models (see methods developed for this article for a template for how to include time-varying covariates in models with interval-censoring, late entry, and clustering).
- Repeat parametric interval-censored analysis with any additional covariates and potential confounding variable included in the model. Compare estimates with unadjusted models.