Bias in parameter estimates for Cox proportional hazard model when covariates are collinear

Question

For linear regression, if $y$ actually depends on two positively correlated covariates $x_1$ and $x_2$ (we can call it the true model), and if we only include one covariate, say $x_1$, in the regression model (we can call it the working model), its coefficient $\beta_1$ will be overestimated. This makes intuitive sense, because now $\beta_1$ represents the effect of both $x_1$ and $x_2$. One can in fact derive that $\tilde{\beta}_1 = \beta_1 + \rho \beta_2$, where $\tilde{\beta}_1$ is the apparent coefficient of $x_1$ in the working model, $\beta_1$ and $\beta_2$ are the actual coefficients of $x_1$ and $x_2$ in the true model, and $\rho$ is the correlation coefficient between $x_1$ and $x_2$. (The derivation is appended at the end.)

Now I am interested in the same question for Cox proportional hazard model. To my surprise, I observe that when the true model has positively-correlated $x_1$ and $x_2$, and the working model has only $x_1$, $\beta_1$ is in fact underestimated, at least when estimated using Cox's partial likelihood method. Here are my simulation codes with some explanatory comments.

library(mvtnorm)
library(survival)

n <- 100000
set.seed(0)

# set the correlation coefficient to be 0.5
sigma <- matrix(c(1,0.5,0.5,1), ncol=2)
X <- rmvnorm(n=n, mean=c(0,0), sigma=sigma)

x1 <- X[,1]
x2 <- X[,2]

b1 <- 1
b2 <- 3

# relative hazards
relhazs <- exp(b1*x1 + b2*x2)

# event times
# assume baseline hazard is a constant function at 1
# so the survival times are simply exp distributed
etimes <- rexp(n, relhazs)

# assume no censorship for simplicity
status <- rep(1, n)

dat <- data.frame(id=1:n,
                  time=etimes,
                  status=status,
                  x1=x1,
                  x2=x2)

Output:

Call:
coxph(formula = Surv(time, status) ~ x1 + x2, data = dat, control = coxph.control(timefix = FALSE))

        coef exp(coef)  se(coef)     z      p
x1  1.004289  2.729965  0.004424 227.0 <2e-16
x2  2.991976 19.925012  0.008271 361.7 <2e-16

Likelihood ratio test=236231  on 2 df, p=< 2.2e-16
n= 100000, number of events= 1e+05

Call:
coxph(formula = Surv(time, status) ~ x1, data = dat, control = coxph.control(timefix = FALSE))

       coef exp(coef) se(coef)     z      p
x1 0.837905  2.311519 0.003825 219.1 <2e-16

Likelihood ratio test=49198  on 1 df, p=< 2.2e-16
n= 100000, number of events= 1e+05

My questions:

Why the underestimation?
Can we possibly derive some analytical relationship between $\tilde{\beta}_1$, $\beta_1$, $\beta_2$ and $\rho$ in this case, just like what we did for linear regression, even for some simple baseline distribution such as $\text{Exp}(1)$ in my simulation?
If I estimate the parameters using a parametric model, then $\beta_1$ is indeed overestimated (please see below). Why the difference between Cox's semi-parametric partial-likelihood-based estimation and parametric full-likelihood-based estimation?

Call:
survreg(formula = Surv(time, status) ~ x1 + x2 + 0, data = dat, 
    dist = "exp")

Coefficients:
       x1        x2 
-1.004653 -2.993058 

Scale fixed at 1 

Loglik(model)= -100274.9   Loglik(intercept only)= -655669.9
    Chisq= 1110790 on 1 degrees of freedom, p= <2e-16 
n= 100000

Call:
survreg(formula = Surv(time, status) ~ x1 + 0, data = dat, dist = "exp")

Coefficients:
      x1 
-2.50526 

Scale fixed at 1 

Loglik(model)= -1302576   Loglik(intercept only)= -655669.9
n= 100000

Appendix:

For linear regression, suppose the true model is $y=\beta_1 x_1 + \beta_2 x_2 + \epsilon$, and the working model is $y=\tilde{\beta}_1 x_1 + \tilde{\epsilon}$. Without the loss of generality, assume $x_1$ and $x_2$ are standardized, so $x_2=\rho x_1$ where $\rho$ is the correlation coefficient. Equating the $y$ in the two equations and plugging in $x_2=\rho x_1$ give $\tilde{\beta}_1 = \beta_1 + \rho \beta_2$.

You might be interested in this post which cites two papers on the non-parametric estimation of the HR. The general relationship is quite complex, but basically you integrate over the distribution of your X — AdamO, Mar 19 '19 at 18:03
Just a note: survreg is actually not estimating the proportional hazards, but an AFT model. To illustrate side by side, see library(icenReg); fit_ph = ic_par(cbind(time, time) ~ x1, data = dat, model = "ph");fit_ph; fit_aft = ic_par(cbind(time, time) ~ x1, data = dat, model = "aft");fit_aft at the end of your code. Note that setting model = "ph" gives nearly identical coefficients as coxph! — Cliff AB, Mar 19 '19 at 18:26
I can write a full answer at some point, but the key is that if you think of the effect of x2 being absorbed into the baseline distribution, you no longer have an exponential distribution, meaning you no longer have constant hazards and no longer have proportional hazards in x1. However, your simulation properly sits inside an AFT model (although you've miss specified the baseline distribution, of course). — Cliff AB, Mar 19 '19 at 18:34
@CliffAB Since Weibull distribution can fit into both the CPH framework and the AFT framework, and exponential distribution is a type of Weibull, is it true that in this case the results of survreg can also be interpreted in the CPH context? — Lei Huang, Mar 19 '19 at 19:53
@CliffAB Thanks for the hint. That makes sense. It would be great if you can write an answer with more details when you have time. — Lei Huang, Mar 19 '19 at 19:54
As for the relation between PH and AFT with a Weibull baseline, they provide equivalent fits up to a change of variables. See the appendix from my icenReg paper for a quick derivation. — Cliff AB, Mar 19 '19 at 20:04
One reference on the inherent omitted-variable attenuation bias in Cox regressions is: J. Bretagnolle,C. Huber-Carol, Scand. J Statistics 15 (1988), 125-138. Logistic regression has a similar issue, with a nice analytic explanation for the related probit model here — EdM, Mar 19 '19 at 20:40

score 4 · Answer 1 · answered Mar 19 '19 at 17:52

4

The Cox model effectively uses weighted least squares, with weights that are functions of the data and of $\hat{\beta}$. It also has the issue that the hazard ratio is not collapsible, meaning that omitting even a completely orthogonal predictor that is related to the outcome will cause a distortion in the remaining $\beta$s.

answered Mar 19 '19 at 17:52

Frank Harrell

91,879
6
178
397

Many thanks! Any reference on Cox model's estimation effectively as weighted least squares? Or on this topic of the bias in parameter estimates of Cox model in general? – Lei Huang Mar 19 '19 at 18:03
If you look at the Newton-Raphson algorithm you'll see how it uses the information matrix to update the parameter value guesses, and a standard text will give you the formula for the information matrix where you'll see the weights coming from risk sets. This is also discussed in the literature as iteratively reweighted least squares. – Frank Harrell Mar 20 '19 at 11:40

score 2 · Answer 2 · answered Mar 21 '19 at 02:00

The situation is worse, in fact, and I think more comical. Try setting the correlation coefficient to 0, and run the same simulation. Surely, we should have less bias, maybe even close to no bias, right? No! We actually observe even more bias (~0.84 vs ~0.28). @Frank Harrell mentioned the non-collapsibility of the hazard ratio, but there is another bias occurring, a causal bias (this may be the same thing as the non-collapsibility, but an alternative perspective).

This causal bias occurs because the likelihood for the Cox model computes survival at time $t+1$ given survival at time $t$. This is fine if we can block all back door paths, but by omitting any risk factor (not even confounders), we fail to block the back door paths. Below I explain why. See the causal DAG below:

More specifically, when we wish to compute the effect of X1 on S_{t+1}, we (implicitly) condition on S_t. In the DAG, S_t is a collider, and this should worry us. For if there is another variable, like X2 that also effects survival, then conditioning on S_t opens up the back door path S_{t+1} <- X2 -> S_t <- X1. So, generally, for any t > 0, we have the effect of X1 being biased in an unpredictable way by any and all other risk factors,

A terrific paper on this, and where I learned this fact, is here: https://core.ac.uk/download/pdf/144149054.pdf

I don't see the relevance here. The successive conditional probabilities are conditionally independent. — Frank Harrell, Mar 21 '19 at 13:25

Bias in parameter estimates for Cox proportional hazard model when covariates are collinear

2 Answers2