Why is the survival probability decreasing the more variables I use?

Question

I'm using coxph in combination with survfit and I observed that, e.g., when I use the following syntax

coxph(Surv(time, event) ~ 1, data = df)
coxph(Surv(time, event) ~ position, data = df)
coxph(Surv(time, event) ~ position + country, data = df)

that the survival functions move more and more leftwards as more and more variables are used, thus, decreasing in time a certain survival probability has achieved. I wonder why this as the data, especially time and event, do not change when considering more variables?!!

The data comprises ~4500 devices which have components built in described by levels of position. Furthermore, they are used across the world with 20-30 countries.

What you do produces an estimate of the survival function for average covariate values (see ?survfit.coxph). This will be different from the estimate of the average survival function you get when omitting some the covariates because of the non-linear dependency of survival on the covariates. See https://stats.stackexchange.com/a/270179/77222 — Jarle Tufto, Apr 01 '22 at 12:18
Thank you! But what is correct, then? I mean all those three functions above are valid(?) but when I'd have to estimate the machine's lifetimes, which one is it? What is right and what is wrong? — Ben, Apr 01 '22 at 12:34
As I said, with covariates included, you get estimates of survival of an average device rather than the average survival of a randomly chosen device. So the functions are not directly comparable. — Jarle Tufto, Apr 01 '22 at 12:40
but aren't above commands giving me average survivals, with and without covariates? I still wonder how they can be different, though the life span and events do not change at all. Is it just higher theory or could even I understand that? — Ben, Apr 01 '22 at 12:44
Can you provide the R code you're using to get the survival probability for each model? Cox models don't model survival probabilities without some extra work. My guess is you're looking at the (conditional) survival function for a particular country and position. — Eli, Apr 01 '22 at 13:13
@Eli Well, this might be suspicious point that keeps me busy. There is nothing else I do for modelling. I moved along http://www.sthda.com/english/wiki/cox-proportional-hazards-model#compute-the-cox-model under "Multivariate Cox regression analysis". — Ben, Apr 01 '22 at 13:30

David Thiessen · Accepted Answer · 2022-04-05T14:45:53.150

If you're including factor variables in your coxph model and not specifically accounting for that when you produce the estimated survival curves, the curves are probably not what you want. As @Jarle notes, the survfit function produces an estimate of the survival curve for an observation with average covariate values. This is in general not equal to the average survival curve of the whole population.

This is especially not equal to the average survival curve if there are factor variables included. Instead surfit will use the reference value of the factor variables.

Here's some example code illustrating the difference. The first plot is an estimate of the average survival curve, and it comes from the model with no covariate. The second model includes the group covariate, and the naive survival curve is significantly shifted to the left. The second plot is not an estimate of the overall survival curve (despite what the title of the plot says!), but is instead an estimate of the survival curve for the reference group. This is obvious when we plot the survival curves for both groups at the same time in plot 3, we see that plot 2 and the red line for plot 3 match exactly. Also, the estimate of the average survival curve from plot 1 is between the red and blue lines in plot 3, as expected.

library(survival)
library(survminer)
n <- 1000
group <- sample(1:2,n,replace=TRUE)
survtime <- rexp(n, rate = 0.7 - (0.3*group))
group <- factor(group)
censortime <- runif(n,0,10)
event <- censortime > survtime
time <- pmin(survtime,censortime)
dat <- data.frame(time,event,group)
mod0 <- coxph(Surv(time,event) ~ 1,
             data = dat)
mod1 <- coxph(Surv(time,event) ~ group,
              data = dat)
newdat <- data.frame(group = factor(c(1,2)))
conditional_mod1 <- survfit(mod1,newdata = newdat)
ggsurvplot(survfit(mod0),
           data = dat)

ggsurvplot(survfit(mod1),
           data = dat)

ggsurvplot(conditional_mod1,
           data=newdat)

^{Created on 2022-04-01 by the reprex package (v2.0.1)}

Note that there is no reason the curve should shift to the left specifically when adding the factor variable, it could shift to the right if the reference group has a longer survival time than average.

EDIT: The role of the newdata is explained in the documentation of survfit.coxph,

newdata
a data frame with the same variable names as those that appear in the coxph formula. It is also valid to use a vector, if the data frame would consist of a single row.

The curve(s) produced will be representative of a cohort whose covariates correspond to the values in newdata. Default is the mean of the covariates used in the coxph fit.

So, the relevant change is to use survfit(mod, newdata = newdat) instead of survift(mod). That makes the fitted curves use the values in newdat, instead of using the average covariate value and reference level of factor variables.

Because ggsurvplot requires a data argument I also used the newdat in that function, but that's secondary to the survfit function.

Very nice illustration. (+1) I suspect that the OP didn't specify a newdata and just used a default based on the original data. I don't know whether survminer does anything to fix the survfit.coxph() issue with that. — EdM, Apr 01 '22 at 16:34
Thank you. I'm also not sure whether survminer makes any corrections, but in this example it doesn't appear to. — David Thiessen, Apr 01 '22 at 16:42
@EdM I'm indeed using the same dataframe each time. So, to get it right, I have to use different dataframes which consist only of the variables I'm using for fitting? — Ben, Apr 05 '22 at 13:38
Hi @DavidLukeThiessen could you please tell me, how I do add/investigate several covariates? Like $ group $ and $ age $ or so. — Ben, Apr 06 '22 at 09:48
@Ben, you can just include all the covariates in the coxph call and the newdata dataframe. The newdata's names must match the covariate names and there is 1 row of newdata for each curve you want to make. So if you want to consider several curves at once you'll need several rows, one for each combination of covariate values you're interested in. I think everything works without any further modification. — David Thiessen, Apr 06 '22 at 15:42

score 3 · Answer 2 · answered Apr 01 '22 at 16:30

The admonition to "read the manual" particularly applies to survival-curve estimates from Cox proportional hazards (PH) models. If you are, say, using the R survminer package, make sure that you read and understand the default settings.*

You can think about the baseline survival curve** similarly to the intercept in a linear regression. An intercept is the outcome estimate when all covariates are at their reference levels. The value of an intercept depends on how covariates are coded: the choices of reference levels for categorical covariates and whether there's been centering of continuous covariates.

This gets further complicated with Cox models. First, instead of a single intercept you have a whole survival curve as a function of time. Second, the software can make unexpected choices for "reference levels" of covariates. I'm hesitant to say anything about the details of the baseline curves you got, as they depend on what software you used (perhaps unconsciously, depending on what other functions your software invoked).

The R survfit.coxph() package reports baseline survival at some "average" values even of categorical covariates! Even the manual page says: "The resulting curve(s) almost never make sense." Once you specify particular values of covariates you get appropriate covariate adjustment for corresponding conditional survival curves, but those baseline curves aren't useful on their own. Other software packages, like rms, can provide realistic baseline curves that are more representative of the data.

So don't worry about apparent differences in default baseline survival curves. Make sure that the Cox model is appropriate for your data and then examine predictions for realistic scenarios.

*I infer that's one of the tools you're using, based on other questions you've posted recently. That has some strengths but I've seen a few questions here that had more to do with not understanding the survminer defaults than with survival analysis per se. Such helper packages don't necessarily use the same default settings as the underlying survival package. You need to check.

**The baseline is inferred after fitting the regression; the survfit() default is the Breslow estimate

Why is the survival probability decreasing the more variables I use?

2 Answers2

Linked