\documentclass{article}
\usepackage{amsmath}
\usepackage{Sweave}
\addtolength{\textwidth}{1in}
\addtolength{\oddsidemargin}{-.5in}
\setlength{\evensidemargin}{\oddsidemargin}
%\VignetteIndexEntry{Using Time Dependent Covariates}
\title{Using Time Dependent Covariates and Time Dependent Coefficients
in the Cox Model}
\author{Terry Therneau \and Cindy Crowson\\ Mayo Clinic}
\begin{document}
\maketitle
\SweaveOpts{keep.source=TRUE}
<>=
options(width=60, continue=" ")
makefig <- function(file, top=1, right=1, left=4) {
pdf(file, width=9.5, height=7, pointsize=18)
par(mar=c(4, left, top, right) +.1)
}
library(survival)
@
\section{Introduction}
One of the strengths of the Cox model is its ability to encompass covariates
that change over time, due to the theoretical foundation in martigales.
A \emph{martingale} (original definition) is a betting strategy in games of
chance. One of the simplest and best known is doubling the bet each time
you lose. For instance consider the following game of roulette:
\begin{equation*}
\begin{tabular}{rcc|c}
Bet & Outcome & Win & Running total \\ \hline
R \$1 & Red & 2 & 1 \\ &&&\\
R \$1 & Black & 0 & 0 \\
R \$2 & Black & 0 & -2 \\
B \$4 & Red & 0 & -6 \\
R \$8 & Black & 0 & -14 \\
B \$16& Black & 32 & 2 \\ &&&\\
B \$1 & Red & 0 & 1 \\
B \$2 & Black & 4 & 3 \\&&&\\
\vdots & \vdots & \vdots &\vdots
\end{tabular}
\end{equation*}
At the end of each cycle of bets the player is another \$1 ahead.
The problem is that a modest sequence of losses will exhaust their stake.
The rule for time dependent covariates in a Cox model is simple
and essentially the same as that for gambling: you cannot look into
the future.
A covariate may change in any way based on past data or outcomes, but it
may not reach ``forward'' in time.
One of the more well known examples of this error is analysis by response:
at the end of a trial a survival curve is made comparing those who had an
early response to treatment (shrinkage of tumor, lowering of cholesterol,
or whatever), and it discovered that response predicts survival.
The problem arises because subjects are classified as responders or
non-responders from the beginning of the study, i.e., they are placed into
group A or B before the response has occurred.
As a consequence, any early deaths that occur before response can be
assessed will be assigned to the non-responder group, even deaths that have
nothing to do with the condition under study.
There are many variations on the error: interpolation of the values of a
laboratory test linearly between observation times, removing subjects who do
not finish the treatment plan, imputing the date of an adverse event midway
between observation times, etc.
All of these are similar to running a red light in your car:
disaster is not guarranteed --- but it is likely.
The most common way to encode time-dependent covariates is to use the
(start, stop] form of the model.
<>=
fit <- coxph(Surv(time1, time2, status) ~ age + creatinine,
data=mydata)
@
In this data set a patient might have the following observations
\begin{center}
\begin{tabular}{ccccccc}
subject & time1 & time2 & status & age & creatinine & \ldots \\ \hline
1 & 0 & 15 & 0 & 25 & 1.3 \\
1 & 15& 46 & 0 & 25 & 1.5 \\
1 & 46& 73 & 0 & 25 & 1.4 \\
1 & 73& 100& 1 & 25 & 1.6 \\
\end{tabular}
\end{center}
In this case the variable \emph{age} = age at entry to the study stays the
same from line to line, while the value of creatinine varies and is treated as
1.3 over the interval $(0, 15]$, 1.5 over $(15, 46]$, etc. The intervals are
open on the left and closed on the right, which means that the creatinine
is taken to be 1.3 on day 15.
The status variable describes whether or not each interval ended in an event.
One commmon question with this data setup is whether we need to worry about
correlated data, since a given subject has multiple observations.
The answer is no, we do not. The reason is that this representation is
simply a bookkeeping trick; the likelihood equations at any time point
use only one copy of any subject, the program picks out the correct
row of data at any given time.
There are two exceptions to this rule, in which case the cluster variance
is necessary:
\begin{itemize}
\item When subjects have multiple events.
\item When a subject appears in overlapping intervals. This however is
almost always a data error, since it corresponds to two copies of the
subject being present at the same time, e.g., they could meet themselves
on the sidewalk.
\end{itemize}
\section{Examples}
\subsection{Multiple events}
Chronic granulomatous disease (CGD) is a heterogenous group of uncommon
inherited disorders characterized by recurrent pyogenic infections that
usually begin early in life and may lead to death in childhood. Interferon
gamma is a principal macrophage-activating factor shown to partially correct
the metabolic defect in phagocytes.
It was hypothesized that treatment with interferon might reduce
the frequency of serious
infections in patients with CGD.
In 1986, Genentech, Inc. conducted a
randomized, double-blind, placebo-controlled trial in 128 CGD patients who
received Genentech's humanized interferon gamma (rIFN-g) or placebo three %'
times daily for a year. The primary endpoint of the study was the time to
the first serious infection. However, data were collected on all serious
infections until the end of followup,
which occurred before day 400 for most
patients. Thirty of the 65 patients in the placebo group and 14 of the 63
patients in the rIFN-g group had at least one serious infection.
The total number of infections was 56 and 20 in the placebo and treatment
groups, respectively.
One patient was taken off on the day of his last infection; all others
have some followup after their last episode.
Below are the first 10 observations, but with the listing truncatated
beyond the fourth infection. Subject 2 has 7 infections, no one in the
study has more.
\small\begin{verbatim}
1 204 082888 1 2 12 147.0 62.0 2 2 2 2 414 219 373
2 204 082888 0 1 15 159.0 47.5 2 2 1 2 439 8 26 152 241 249 322 350
3 204 082988 1 1 19 171.0 72.7 1 2 1 2 382
4 204 091388 1 1 12 142.0 34.0 1 2 1 2 388
5 238 092888 0 1 17 162.5 52.7 1 2 1 1 383 246 253
6 245 093088 1 2 44 153.3 45.0 2 2 2 2 364
7 245 093088 0 1 22 175.0 59.7 1 2 1 2 364 292
8 245 093088 1 1 7 111.0 17.4 1 2 1 2 363
9 238 100488 0 1 27 176.0 82.8 2 2 1 1 349 294
10 238 100488 1 1 5 113.0 19.5 1 2 1 1 371
\end{verbatim}
\normalsize
<<>>=
cgdname <- c("id", "center", "random.dt", "trt", "sex", "age",
"height", "weight", "inheritance", "steroids",
"antibiotic", "inst", "futime", paste("e", 1:7, sep=''))
cgd1 <- read.table('cgd.dat', header=F, fill=T, col.names=cgdname,
colClasses=c("integer", "integer", "character",
rep("integer", 3), rep("numeric",2),
rep("integer", 12)))
cgd1$ninfect <- rowSums(!is.na(as.matrix(cgd1[,14:20])))
cgd1$random.dt <- as.Date(cgd1$random.dt, format="%m%d%y")
cgd1$sex <- factor(cgd1$sex, labels=c("M", "F"))
cgd1$inst <- factor(cgd1$inst, labels=c("NIH", "US-Other",
"Amsterdam", "Europe-Other"))
temp <- apply(as.matrix(cgd1[,13:20]), 1, function(x) {
z <- as.vector(x[!is.na(x)])
if (length(z)==1) cbind(0, z, 0)
else {
temp <- cbind(c(0, z[-1]),
c(z[-1], z[1]),
c(rep(1, length(z)-1), 0))
if (z[1]== z[length(z)]) temp[-nrow(temp),]
else temp
}})
index <- rep(1:nrow(cgd1), unlist(lapply(temp, nrow)))
cgd2 <- data.frame( cgd1[index, 1:12],
time1= unlist(lapply(temp, function(x) x[,1])),
time2= unlist(lapply(temp, function(x) x[,2])),
event= unlist(lapply(temp, function(x) x[,3])),
enum = unlist(lapply(temp, function(x) 1:nrow(x))))
cfit <- coxph(Surv(time1, time2, event) ~ trt + sex + age +
inheritance + cluster(id), data=cgd2)
@
The logic for creating the time variables is
\begin{itemize}
\item If a subject has no events there is a single interval from 0 to
last follow-up, with a status of 0.
\item If there are events
\begin{itemize}
\item There is an interval for each event (0, event1], (event1, event2],
etc. each with status =1.
\item If the follow-up time exceeds the last event there will be a
final interval with status =0.
\end{itemize}
\end{itemize}
The result of the \texttt{apply} function above will be a list with one
item per patient, each item being a 3 column matrix with time1,
time2, and status.
The as.vector command removes names, which can speed the operation.
The resulting data frame \texttt{cgd2} is in the proper format for the
analysis.
\subsection{Changing lab tests}
To be filled in.
\subsection{Predictable time-dependent covariates}
Occasionaly one has a time-dependent covariate whose values in the future
are predictable.
The most obvious of these is patient age, occasionally this may also be
true for the cumulative dose of a drug.
If age is entered as a linear term in the model, then the effect of changing
age can be ignored in a Cox model, due to the structure of the partial
likelihood. Assume that subject $i$ has an event at time $t_i$, with other
subject $j \in R_i$ at risk at that time, with $a$ denoting age.
The partial likelihood term is
\begin{equation*}
\frac{e^{\beta * a_i}}{\sum_{j \in R_i} e^{\beta* a_j}} =
\frac{e^{\beta * (a_i + t_i)}}{\sum_{j \in R_i} e^{\beta* (a_j + t_i)}}
\end{equation*}
We see that using time-dependent age (the right hand version) or age at
baseline (left hand), the partial likelihood term is identical since
$\exp(\beta t_i)$ cancels out of the fraction.
Howevever, if the effect of age on risk is \emph{non-linear}, this
cancellation does not occur.
Since age changes continuously, we would in theory need a very large
(start, stop] data set to completely capture the effect --- an interval
per day to match the usual resolution for death times.
In practice this level of resolution is not necessary; though we all
grow older, risk does not increase so rapidly that we need to know
out age to the day!
For most medical applications year is sufficient, but this still leads
to a large data set.
One useful way to generate this data set is through use of the \texttt{pyears}
function.
The following example uses data on rehospitalization for cohort of
rheumatoid arthritis
patients who also have conjestive heart failure (CHF)\cite{Nicola05}.
The variables are
\begin{itemize}
\item patid: patient identifier
\item agechf: age at onset of CHF
\item yearchf calendar year of CHF, relative to the start of the study
\item startday, stopday: an interval of risk
\item hospevt: =1 if the interval ends with a hospitalization
\item prevhosp: number of prior hosptializations
\item duration: duration of RA prior to CHF
\item male sex: 1 for male, 0 for female
\end{itemize}
The age and duration variables have been rounded to .1 year to
maintain patient privacy.
<<>>=
load('raheart.rda')
age2 <- tcut(raheart$agechf*365.25, 0:110* 365.25, labels=0:109)
rowid <- 1:nrow(raheart)
pfit <- pyears(Surv(startday, stopday, hospevt) ~ age2 + rowid,
data=raheart, data.frame=TRUE, scale=1)
print(pfit$offtable)
pdata <- pfit$data
print(pdata[1:6,])
@
The \texttt{tcut} function attaches a set of cutpoints to the starting age
for each subject, it's primary job is to mark the variable as time-increasing%'
for the \texttt{pyears} function.
In the \texttt{pyears} call we set scale=1 to prevent the age intervals from
being rescaled to years, this is not critical.
Printing out the value of offtable is important, however. One of them
most common mistakes in using \texttt{pyears} is mismatched scales, for
instance if the age were in years and the follow-up time in days, and
a common result of that error is to have follow-up time that fits into
none of the categories.
This will give a large amount of time that is outside the boundaries of the
table.
In the resulting data frame the first observation has 7 days of follow-up,
exactly as in the starting data.
The second observation has been broken into 4 rows, 109.6 days at age 92,
then a year each at age 93 and 94, and a final 234.9 days at age 95
ending with a hospitalization.
(The odd fractions of a day like .575 are a consequence of rounding the
age values to 1 digit.)
Now we combine this with the original data set using the same indexing
trick found in the first example.
We also need a variable containing the end time for the prior row of
each subject and zero for the first row of each subject, which is
\texttt{lagtime} below.
We then fit two models. The first looks at the effect of age at
diagnosis of CHF, the second at the effect of current age.
<<>>=
index <- as.integer(pdata$rowid)
lagtime <- c(0, pdata$pyears[-nrow(pdata)])
lagtime[1+ which(diff(index)==0)] <- 0 #starts at 0 for each subject
temp <- raheart$startday[index] + lagtime #start of each new interval
data2 <- data.frame(raheart[index,],
time1= temp,
time2= temp + pdata$pyears,
event= pdata$event,
age2= 1+ as.numeric(pdata$age2) )
afit1 <- coxph(Surv(startday, stopday, hospevt) ~ male + pspline(agechf),
data=raheart)
afit2 <- coxph(Surv(time1, time2, event) ~ male + pspline(age2), data2)
#termplot(afit1, terms=2, se=TRUE, xlab="Age at Diagnosis of CHF")
#termplot(afit2, terms=2, se=TRUE, xlab="Current Age")
table(with(raheart, tapply(hospevt, patid, sum)))
@
In this particular case the two fits are quite similar.
In retrospect this perhaps should have been expected: the mean age
for onset of CHF in this group is 75 years, which does leave a lot
of time for aging.
(This analysis is very preliminary, however. As shown by the last line
above there are a few patients with a very large number of admissions,
sometimes referred to as ``entering a revolving door'' near the end of
their disease. These have an untoward influence on the fit.)
\subsection{Predictable covariates, method 2}
Another method to create a time-changing covariate is to use the
\emph{time-transform} feature of coxph.
<<>>=
afit2b <- coxph(Surv(startday, stopday, hospevt) ~ male + tt(agechf),
data=raheart,
tt=function(x, t, ...) pspline(x + t/365.25))
afit2b
@
If there are one or more terms on the right hand side of the equation
marked with the tt() operator, the program will pre-compute the values
of that variable for each unique event time and strata combination.
A user-defined function is called with arguments of
\begin{itemize}
\item the covariate: whatever is inside the tt() call
\item the event time
\item the event number: if there are multiple strata and the same event
time occurs in two of them, they are treated separately
\item the weight for the observation, if the call used weights
\end{itemize}
There is a single call to the function with a very large $x$ vector,
it contains an element for each subject at risk at each event time.
If there are multiple tt() terms in the formula, then the tt argument
should be a list of functions with the requisite number of elements.
There are other interesting uses for the time-transform capability.
One example is O'Brien's logit-rank test procedure \cite{obrien78}.
He proposed replacing the covariate at each event time with a logit
transform of its ranks.
This removes the influence of any outliers in the predictor $x$.
For this case we ingore the event time argument and concentrate on the
groupings.
<<>>=
function(x, t, riskset, weights){
obrien <- function(x) {
r <- rank(x)
(r-.5)/(.5+length(r)-r)
}
unlist(tapply(x, riskset, obrien))
}
@
This relies on the fact that the input argments to tt() are ordered by the
event number or riskset.
This function is used as a default if no tt argument is present in the
coxph call, but there are tt terms in the model formula.
(Doing so allowed me to depreciate the survobrien function).
Another interesting useage is to replace the data by simple ranks, not
rescaled to 0--1.
<<>>=
function(x, t, riskset, weights)
unlist(tapply(x, riskset, rank))
@
The score statistic for this model is $(C-D)/2$, where $C$ and $D$ are the
number of concordant and discordant pairs, see the
survConcordance function.
The score statistic from this fit is then a test for significance of the
concordance statistics, and is in fact the basis for the standard error
reported by survConcordance.
The O'Brien test can be viewed as concordance statistic that gives equal %'
weight to each event time, whereas the standard concordance weights each
event proportionally to the size of the risk set.
(The Cox score statistic depends on the mean $x$ at each event time;
since ranks go from 1 to number at risk the mean also scales.)
Although handy, the computational impact of the tt argument should be
considered before using it.
The Cox model requires computation of a weighted mean and variance of
the covariates at each event time, a process that is inherently
$O(ndp^2)$ where $n$ = the sample size, $d$ = the number of events and $p$=
the number of covariates.
Much of the algorithmic effort in coxph() is to use updating methods for
the mean and variance matrices, reducing the compute time to $O((n+d) p^2)$.
When a tt term appears updating is not possible; for even moderate size
data sets the impact of $nd$ versus $n+d$ can be surprising.
The time-transform is a new addition and still has some rough edges.
At this moment the $x=T$ argment is needed to get proper residuals and
predicted values, and termplot is unable to properly reconstruct the
data to plot the spline fit.
Please communicate any concerns or interesting examples to the author.
\begin{thebibliography}{9}
\bibitem{Nicola05} Nicola PJ, Maradit-Kremers H, Roger VL, Jacobsen SJ,
Crowson CS, Ballman KV, Gabriel SE. ``The risk of Congestive Heart Failure
in Rheumatoid Arthritis: a Population-Based Study Over 46 Years.''
\emph{Arthritis Rheum} 52: 412--20, 2005.
\bibitem{obrien78} O'Brien, Peter, ``A non-parametric test for %'
association with censored data'', \emph{Biometrics} 34:243--250, 1978.
\end{thebibliography}
\end{document}