Cox model with factors

Question

I have a simple Cox proportional hazards model, with a small made-up data set. There are three groups, and the data is generated so that the time-to-event is larger in the third group than the other two. Please note that this code is copied from another poster's example (see the following link: KM versus Cox model):

times <- rexp(30)
status <- rep(1, 30)
groups <- rep(c(1,2,3), 10)
dat <- data.frame(groups, times, status)
dat[dat$groups==1, "times"] <-  dat[dat$groups==1, "times"]+6
mod1<- coxph(Surv(times, status) ~ groups, dat)
summary(mod1)

That is fine, but the problem appears when I explicitly make the 'groups' variable a categorical variable:

dat$groups <- factor(dat$groups)
mod1<- coxph(Surv(times, status) ~ groups, dat)
summary(mod1)

I get sensible results when the group variable is numeric, but not when they are converted to factor. It seems like the model, if anything, should be MORE sensible when the groups are categorical.

What is happening? Why when I run the model with categorical 'group' variable do I get the following warning:

In coxph.fit(X, Y, istrat, offset, init, control, weights = weights,
:
Loglik converged before variable  1,2 ; coefficient may be infinite.

EdM · Accepted Answer · 2023-01-02T21:32:35.533

This is similar to perfect separation in logistic regression. When I followed your code after using set.seed(102), the largest value of "times" was 3.35. Adding 6 to those in group1 put their survival times longer than any of the individuals in the other 2 groups.

Treating group as a linear continuous predictor assumes that each extra level of the numeric group has the same extra association with the log-hazard of an event. With 3 groups that poses no problem in solving the score equation to get a regression coefficient. The software starts with an estimated regression coefficient value of 0 (hazard ratio of 1) for the numeric group predictor, then iteratively adjusts that value to best correspond to a single linear association between log-hazard and the 3 numeric values of group. In this case, the result ends up a compromise between the (infinite) log-hazard difference between group1 versus the others and the (lack of) log-hazard difference between the second two groups.

When you converted to a factor and (implicitly) used group1 as the reference level, you created a problem. The regression coefficients for the other 2 groups then represent the log-hazard associated with membership in those two groups, relative to that in group1. Similarly to the above, the software will start with dummy coefficients for the other 2 groups of 0, and then interactively try to find values for each of them that best represent the observed log-hazard relative to that of group1

But all of the events in the other 2 groups happened before any of the events in group1. Those 2 groups have infinitely higher hazards than the reference group, group1. So as the software tries to adjust the initial coefficient estimates to match the data, the coefficients run off toward infinity.

You can get results more like you were expecting if you add a value less than 6 to group1 (adding 3 worked for me) or (maybe) if you set a different group as the reference level of the factor. Alternatively, if you only had 2 groups with non-overlapping survival times and treated them as numeric rather than categorical, you would end up with the same lack of convergence as there is no longer a third group that requires compromise to provide a single regression coefficient.

Thank you! This was helpful. I understand the difference between treating the variable as a continuous (numeric) predictor and treating it as a factor with a reference level, but in either case, if there is something like 'perfect separation' when moving from group one to groups 2 or 3, why is it only a problem when the variable is treated as a factor?
When the predictor is numeric, then the estimated coefficient describes the (log) HR for each unit increase, and when it is a factor, then the coefficients describe the HR when moving from the reference group to groups 2 and 3. — Aaron, Dec 28 '22 at 16:36
But if the survival time for group 1 is longer than the survival times for any member of group 2 and group 3, then that is true regardless of whether the group variable is numeric or categorical.
Can you please provide any additional clarification on this point? It is possible that if I understood the score equation better this confusion would go away, but I am still a little stuck on this issue. — Aaron, Dec 28 '22 at 16:36
By the way, I tried using the same data set generation but with 300 rows instead of 30, and the problem went away, confirming what you said: there was no longer an issue with 'perfect separation' and the estimates were reasonable/non-infinity. Also, redoing the original data set but adding 2 instead of 6 resulted in sensible estimates. — Aaron, Dec 28 '22 at 16:36
@Aaron what saves things with the numeric values in your example is having more than 2 numeric values for group. If you only had 2 groups with non-overlapping survival times, the lack of convergence problem would also appear. — EdM, Jan 02 '23 at 21:35

Cox model with factors

1 Answers1