How to compute partial log-likelihood function in Cox proportional hazards model?

Question

The partial log-likelihood function in Cox proportional hazards is given with such formula $${}_{p}\ell(\beta) = \sum\limits_{i=1}^{K}X_i'\beta - \sum\limits_{i=1}^{K}\log\Big(\sum\limits_{l\in \mathscr{R}(t_i)}^{}e^{X_l'\beta}\Big),$$ where $K$ is the number of observations for which we have observed an event (where generaly there were $n$ observations, so $K-n$ observations were censored) and $\mathscr{R}(t_i)$ is a risk set for time $t_i$ defined as: $\mathscr{R}(t_i) = : \{X_j: t_j >= t_i, j = 1, \dots, n \}$.

I am trying to implement function calculating this partial log-likelihood function for given $\beta$ vector and input data set. I thought it's clever to first sort data by the observed time such that higher row number indicates higher survival time. For such a form of data I have prepared an implementation for 2 explanatory variables:

full_cox_loglik <- function(beta1, beta2, x1, x2, censored){
  sum(rev(censored)*(beta1*rev(x1) + beta2*rev(x2) -
                       log(cumsum(exp(beta1*rev(x1) + beta2*rev(x2))))))
}

where beta1 is a coefficient for x1 variable, beta2 is a coefficient for x2 variable and censored is a vector indicating whether the observation had event (then $1$) or was censored (then $0$).

Being careful I have also prepared a second longer implementation that generally works for every dimension - it also takes data in a sorted format. This assumes that dCox is a data frame with explanatory variables and with censored column, beta is a vector of coefficients and status_number indicates the number of a column that has the information about censroing:

library(foreach)
partial_coxph_loglik <- function (dCox, beta, status_number) {
  n <- nrow(dCox)
  foreach(i=1:n) %dopar% {
    sum(dCox[i,status_number]*(dCox[i,-status_number]*beta))
  } %>% unlist -> part1


  foreach(i=1:n) %dopar% {
    exp(sum(dCox[i,-status_number]*beta))
  } %>% unlist -> part2


  foreach(i=1:n) %dopar% {
    part1[i] - dCox[i, status_number]*(log(sum(part2[i:n])))
  } %>% unlist -> part3

  sum(part3)
}

Where to be more readable the partial log-likelihood would be (for that implementation):

$${}_{p}\ell(\beta) = \sum\limits_{i=1}^{K} \text{part1}_i - \sum\limits_{i=1}^{K}\log\Big(\text{part2}_i\Big).$$

For this implementation I have tried to calculate the values of the partial log-likelihood for the Cox proportional models for data that were generated from real $\beta$ parameters that were set to beta=c(2,2).

But I have received results telling me that the maximum of partial log-likelihood is not in the point beta=c(2,2) but far from this point.

One can prepare simulated survival data for Cox proportional hazards model, that came from Weibull distribution, with the metod explained here https://stats.stackexchange.com/a/135129/49793 . The similiar implementation based on that solution is below (works for 2 explanatory variables):

set.seed(456)
dataCox <- function(N, lambda, rho, x, beta, censRate){

  # real Weibull times
  u <- runif(N)
  Treal <- (- log(u) / (lambda * exp(x %*% beta)))^(1 / rho)

  # censoring times
  Censoring <- rexp(N, censRate)

  # follow-up times and event indicators
  time <- pmin(Treal, Censoring)
  status <- as.numeric(Treal <= Censoring)

  # data set
  data.frame(id=1:N, time=time, status=status, x=x)
}

x <- matrix(sample(0:1, size = 2000, replace = TRUE), ncol = 2)

dataCox(10^3, lambda = 5, rho = 1.5, x, beta = c(2,2), censRate = 0.2) -> dCox

So for this implementation and simulated data I have calculated the values of partial log-likelihood for beta from c(0,0) to c(2,2) and received such results:

library(dplyr)
dCox %>%
  dplyr::arrange(time) -> dCoxArr
beta1 <- seq(0,2,0.05)
beta2 <- seq(0,2,0.05)
res <- numeric(length(beta1))
for(i in 1:length(beta1)){
  full_cox_loglik(beta1[i], beta2[i], dCoxArr$x.1,
                  dCoxArr$x.2, dCoxArr$status ) -> res[i]

}

library(ggplot2)
qplot(beta1, res)

res2 <- numeric(length(beta1))
for(i in 1:length(beta1)){
  partial_coxph_loglik(dCoxArr[, c(4,5,3)],
                       c(beta1[i],beta2[i]), 3 ) -> res2[i]
  cat("\r", i, "\r")
}


qplot(beta1, res2)

It does not look like the maximum of partial log-likelihood function is in the point beta = c(2,2) from which I have generated data. So now there appears my questio? Where did I make mistake? In data generation? In partial log-likelihood implementation? Or somewhere esle?

score 7 · Accepted Answer · answered Dec 17 '15 at 22:27

7

This is technically a programming question with an easy programming answer. If you simply want the partial likelihood, why not fool R into giving it to you? Simply initialize beta and allow no iterations, then extract the loglik value from the coxph object. (see ?coxph.object).

For example:

## artificial data
library(survival)
n <- 1000
t <- rexp(100)
c <- rbinom(100, 1, .2) ## censoring indicator (independent process)
x <- rbinom(100, 1, exp(-t)) ## some arbitrary relationship btn x and t
betamax <- coxph(Surv(t, c) ~ x)
beta1 <- coxph(Surv(t, c) ~ x, init = c(1), control=list('iter.max'=0))

With example output:

> betamax$loglik
[1] -68.62548 -65.99652
> beta1$loglik
[1] -66.10908 -66.10908

You can even define a wrapper:

loglik <- function(beta, formula) {
  formula, init=beta, control=list('iter.max'=0))$loglik[2]
}

betas <- seq(0, 2, by=0.01)
logliks <- sapply(betas, loglik, Surv(t, c) ~ x)
plot(betas, logliks)
abline(v=betamax$coefficients)

answered Dec 17 '15 at 22:27

AdamO

62,637

That's brilliant :) I was even reading the C code of survival::Ccoxfit6 to reimplement this on R but your suggestion is faster. But this does not change the situation that my R implementation looks to be wrong somehow and I still do not know the reason – Marcin Dec 17 '15 at 23:26
2

@MarcinKosiński it'll take a long time to figure out exactly what's wrong. The partial_log_likelihood function looks pretty sloppy. Have you tried writing it in a sample sapply loop instead? One thing I notice in your derivation is that you use both failure and censoring times to index risk sets. Only the failure times index risk sets. If a individual is censored between two failure times, the arbitrary baseline hazard means you can't conclude anything extra except that they lived up to the most recent failure before they were censored. – AdamO Dec 17 '15 at 23:47
Thanks for comment @AdamO. One thing I notice in your derivation is that you use both failure and censoring times to index risk sets. Only the failure times index risk sets - but since the sum is over $\sum_{i=1}^{K}$ where the $K$ is a number of observed events I only add value to partial log-lik for observations with events. For censored observations I just multiple those parts by zero, which can be seen in receiving part3 : part1[i] - dCox[i, status_number]*(log(sum(part2[i:n]))) . – Marcin Dec 17 '15 at 23:59
do you by chance maybe know how to extract gradient values from survival::coxph for each iteration step? – Marcin Dec 18 '15 at 00:15
@MarcinKosiński just define your own wrapper as I have and calculate 1-step estimators by setting iter.max=1 in the control and inputting the 1-step beta estimators as arguments to init as I did before. – AdamO Dec 18 '15 at 17:17
Isn't those estimates come from multiplication of gradient and hessian matrix? I am interesten only in gradient :) here's a question about that http://stats.stackexchange.com/questions/187395/how-to-compute-gradient-of-partial-log-likelihood-function-in-cox-proportional-h – Marcin Dec 18 '15 at 17:21

How to compute partial log-likelihood function in Cox proportional hazards model?

1 Answers1

Linked

Related