3

I'm working with cox proportional hazard (cox-PH) in R. I'm using the smoothed scaled Schoenfeld residual plot to assess the PH assumption.

Suppose I have this model in R, disease as outcome and genetics as exposure:

#Residual plot for genetics:
cox_model <- coxph(Surv(start, end, disease)~1+factor(genetics)+strata(gender), robust =T, data=mydata)

#Residual plot: test.ph <- cox.zph(cox_model) plot(test.ph, resid = FALSE)

The covariate genetics has levels in it. Meaning I have made a loop for the cox regression so it can run (make the analysis) for every single level in the covariate genetics. The levels are named like this geneA and geneB. So the covariate genetics has two levels.

When I run a PH test in R, and I look at the residual plot, then it says that the covariate on the plot is genetics. It does not show me if it is geneA or geneB. Then I tried to specify the model, so I can visualise the residual plot seperatly for geneA and geneB, like this:

#Residual plot for geneA:
cox_model <- coxph(Surv(start, end, disease)~1+factor(genetics==geneA)+strata(gender), robust =T, data=mydata)

#Residual plot test.ph <- cox.zph(cox_model) plot(test.ph, resid = FALSE)

#Residual plot for geneB: cox_model <- coxph(Surv(start, end, disease)~1+factor(genetics==geneB)+strata(gender), robust =T, data=mydata)

#Residual plot test.ph <- cox.zph(cox_model) plot(test.ph, resid = FALSE)

I get two different plots (which is not a surprise). But what I'm confused about is, that the first plot I get with genetics as the covariate, looks different from the plots of geneA and geneB. So the conclusion is I get three different residual plots, one for geneA, second for geneB and a third for the covariate genetics with both levels?

My question is, how should I interpret the smooth scaled Schoenfeld residual plot that shows the covariate genetics? Is it a combination of both geneA and geneB? And what does that combination mean?

Many thanks in advance !!! :)

Update: I'm using a 3 level co-factor, meaning genetics has 3 levels saying geneA, geneB and none. EdM has answered my Q below

Aphi11
  • 163
  • 7
  • Please edit the question to add a sample model summary and the corresponding residual plots. I’m not sure that I understand the situation just from the written description. It might have to do with the choice of reference level for the categorical predictor. – EdM Jan 06 '24 at 10:12
  • If a categorical predictor has more than 2 levels, the cox.zph() function (in recent versions) can provide either an overall plot for the predictor or plots for individual levels versus the reference. When you edit the question, please specify the version of the software that you are using. – EdM Jan 06 '24 at 15:16
  • Thanks for answering @EdM. I'm working in a closed system that makes me unable to copy paste codes. I would have to manually write more than 30 lines of coding. I hope my following comment can clarify my Q. – Aphi11 Jan 07 '24 at 12:01
  • It's the newest version of cox.zph() I'm using. It is exactly what it's providing me - an overall plot for the predictor, genetics (which has two levels, geneA and geneB). So it's showing me one plot with one curve for both geneA and geneB. My Q is, how should I interpret the overall plot? Because when I look at the individual levels plot (geneA and geneB separately), they look different from each other. So what is the overall plot showing? The average HR over time for both geneA and geneB? Or how should I think about it? – Aphi11 Jan 07 '24 at 12:03
  • I have updated the question with adding a short example of what I have written in R. – Aphi11 Jan 07 '24 at 12:10

1 Answers1

2

Your attempt to isolate individual levels of genetics isn't doing what you think. The question that you raise about interpretation of Schoenfeld plots is the same as you might have raised about interpretation of model coefficients.

For a 2-level factor you are evaluating the difference in outcome between the 2 levels. You are always thus including both levels in the calculations and plots. The only issue is the direction of the difference, set by the choice of reference level. All 3 models you show in your question are the same except for their choice of reference level of the genetics factor.

If you don't specify otherwise, the reference level for a factor is the lowest in alphanumeric order of the level names, presumably geneA in your data. The regression coefficient is for the log-hazard difference between the second level and the reference level, or for geneB-geneA in this case.

When you specify factor(genetics==geneA) instead, you are simply reversing the choice of reference level. Both geneA and geneB cases are still included, but the reference level is now for a value of 0 (FALSE) for genetics==geneA; that is, the reference level is now geneB and the regression coefficient is for the geneA-geneB log-hazard difference. Specifying factor(genetics==geneB) just re-reverses the reference-level choice and gives the same model as you get with factor(genetics).

Here's are reproducible examples based on the lung dataset built into the R survival package, similar to your models. Sex is coded Male=1, Female=2, providing a 2-level categorical predictor like your genetics. I included 2 strata based on the inst numbers distinguishing treating institutions. There's no need to include the 1 in the formula, but I kept it to match your code.

library(survival)
mod1 <- coxph(Surv(time, status) ~ 1 + factor(sex) + strata(inst<10), data=lung)
mod2 <- coxph(Surv(time, status) ~ 1 + factor(sex==1)+ strata(inst<10), data=lung)
mod3 <- coxph(Surv(time, status) ~ 1 + factor(sex==2)+ strata(inst<10), data=lung)

The coefficients for mod1 and mod3 are the same; their scaled Schoenfeld residual plots are the same except for the y-axis label. The coefficient for mod2 is just the negative of that for the other models; its scaled Schoenfeld residual plot is a mirror image of the others. Both levels of genetics are involved in all models and plots.

You should know that things are a bit more complicated with a factor having 3 or more levels. Then the terms argument to cox.zph() determines whether you evaluate all levels together or evaluate each other level relative to the reference level.

In response to comments

Comments on this answer indicate that the genetics predictor actually was a 3-level categorical predictor, with levels none (the reference), geneA (for one type of variant), and geneB (for a different variant).

In that case the first residual plot produced by the code in the question, with the default terms=TRUE argument to cox.zph(), combines the scaled Schoenfeld residuals over all levels of genetics. That's analogous to how an anova() function combines a test over all levels of a categorical predictor, and is probably the most useful analysis and plot for evaluating PH for genetics.

With the 3-level genetics predictor, the second and third models in the question still didn't do what might have been intended. The factor(genetics==geneA) model compared geneA cases against the combination of none and geneB cases, and similarly for the factor(genetics==geneB) model. I suspect that neither the models nor their PH assessments would be very useful in practice.

If for some reason you want separate analyses and plots for geneA versus none and for geneB versus none, specify terms=FALSE in cox.zph().

EdM
  • 92,183
  • 10
  • 92
  • 267
  • As always, thanks a lot EdM! You are so good at explaining stuff about cox regression. I finally understand it now. I was confused before because I thought that the reference level was those who did not carry the cofactor genetics (meaning neither geneA or geneB). That's why I was confused about what the plot was showing. what if I wanted to see if the PH assumption is hold true between those carring the genetics and those who does not (the outcome would still be disease). Would that make sense to compare? – Aphi11 Jan 07 '24 at 18:46
  • @SAphi11 perhaps you are already using a 3-level genetics predictor? That would make sense if the only (and mutually exclusive) options were none, geneA and geneB. Look at summary(factor(my.data$genetics)). If you aren't, I would recommend that. In that case you could set the reference level (via relevel()) to be those without either of geneA or geneB. You would choose, with the terms argument to cox.zph(), whether to get an overall estimate of PH combining all levels (my favorite) or to evaluate each of geneA and geneB separately with respect to the reference level. – EdM Jan 07 '24 at 19:01
  • I missed that detail. I am using a 3-level genetics as you say. It's a none, geneA and geneB. So wouldn't that change the interpretation of the first plot I got from R, by visualising the PH test by only specifying the cofactor genetics (the plot that provided me results with genetics on the y-axis). Because then I don't understand which of the levels it is using as a reference and what it is compared to. When you say "get an overall estimate of PH combining all levels". Does that mean that level 0, A and B are all combined? Or the combination of geneA or geneB and comparing it to not carrying – Aphi11 Jan 07 '24 at 19:32
  • I just found out that, the reference level by default is none (as it is assigned a zero as factor). So I'm suspecting that the first residual plot I got, used none as ref, and combined the HR of geneA and B. Does that make sense? and was it that one you pointed out as you favorite? – Aphi11 Jan 07 '24 at 20:09
  • @SAphi11 I've added a bit to the end of the answer to cover the 3-level situation. – EdM Jan 07 '24 at 20:48
  • thank you so much! but when you write "the first residual plot....combines the scaled Schoenfeld residuals over all levels of genetics". What is then used as a ref? none=0 ? and then the curve/line in the plot (the HR over time), is that the average HR of gene a and gene b? – Aphi11 Jan 09 '24 at 08:58
  • @SAphi11 thinking about your case more, I'd recommend using terms=FALSE in cox.zph(), as that would directly show how much either of the geneA or geneB coefficients (with none as the reference) might violate PH. With the default terms=TRUE, cox.zph() returns coefficient-weighted residual sums over all non-reference levels. The curve is then for the "linear predictor" including all non-reference levels, for which the combined coefficient is 1. The curve with terms=TRUE shows the shape in time of overall PH violation, but not which level(s) might be responsible. – EdM Jan 09 '24 at 14:51
  • Thanks a lot! It all makes more sense now!! I have by now done both thing. And everything's works. I have made residual plots for all combined, aswell as residual plots showing geneA vs. none, and geneB vs. none. In your opinion, what would be best to include in a paper (in the supplementary)? I'm interested in including less plots. So wouldn't it still be okay to visually inspect the residual plots by looking at all the levels combined? Without necessary looking at how each level is contributing to the predictor line, as long as the PH assumption is hold? – Aphi11 Jan 12 '24 at 13:30
  • I see that it might be a problem if including a plot that violates PH without showing which of the levels that contribute to that. I would really like to hear your opinion on this – Aphi11 Jan 12 '24 at 13:33
  • @SAphi11 if PH isn't violated, why show plots at all? If it is violated for a categorical predictor, what to show depends on the nature of the predictor. If there's no clearly defined reference level, the combined plot with terms=TRUE is OK. You, however, have a substantively meaningful reference (none) and are specifically interested in how geneA and geneB are associated with outcomes different from those with none. With a substantively meaningful reference, you probably want to use terms=FALSE. Don't worry about how many figures to include in a supplement. – EdM Jan 12 '24 at 15:56
  • it is because the PH is violated when only testing by cox.zph. But the visual inspection with residual plots shows almost a linear line. So I wanted to show that the PH is not violated by visual inspection. But I see what you mean. It makes more sense to have the none as a reference for each level. But then in which case would you use the combined residual plot? – Aphi11 Jan 12 '24 at 16:53
  • @SAphi11 the combined residual plot makes sense when the choice of reference level is arbitrary and all levels are more or less equivalent in terms of the hypothesis that you are testing. My sense is that you are more interested in the differences of each of geneA and geneB from none, so the separate plots seem to make more sense in your situation. – EdM Jan 12 '24 at 20:42
  • as always, a big thank you!! I definitely see what you mean. – Aphi11 Jan 13 '24 at 16:10