1

enter image description hereI have a Cox regression with 9 covariates as below in R:

coxph(Surv(Tstart, Tstop, status) ~ cancer+A+B+C+D+E+F+(1|clusterid), data = dat)

I would like to check if I need to include all the A,B, C, D, ... in the model. Can I choose based on lower AIC?

Also, is it necessary to check the Cox PH assumption as my main exposure cancer is time-varying in the model meaning that I modified my data so that to be correct for cancer as time-dependent.

1 Answers1

3

For your first question,* automated variable selection is fraught with difficulties. If you do it, your results might not extend well to new data.

Frank Harrell provides extensive guidance in his course notes and Regression Modeling Strategies (RMS) book, particularly in Chapter 4 with respect to general strategies.

If you only have 9 predictors (including all levels beyond the first of categorical variables and any non-linear or interaction terms) there might be no need to reduce the number at all if you have on the order of 150 events, as you probably aren't at risk of severe overfitting. Even if some covariates don't have "statistically significant" coefficients, keeping them in will improve the predictive ability of the model.

If you have too many predictors for the number of events, see his recommendations for data reduction, or consider penalized regression (e.g., ridge) for covariates not of primary interest.

An exception considered by Harrell is to do "limited backwards step-down variable selection if parsimony is more important than accuracy." (RMS, page 97).

For your second question, including a predictor as a time-varying covariate does NOT remove the need to evaluate its adherence to proportional hazards. A Cox model evaluates the covariates based on their values at event times. The association between a covariate and the hazard of an event certainly might change over time.

As I recall, you have a model that uses patient age as the time scale and reaching some level of disability as the event. Do you think that the extra log-hazard of an event due to having cancer is the same for a 30-year-old as it is for an 80-year-old? That's what a proportional hazards (PH) model would assume in that scenario. You need to evaluate that and at least illustrate any substantial non-proportionality with, for example, a plot of smoothed scaled Schoenfeld residuals over time.


*I recall that you have been warned about the dangers in treating both the "cancer" predictor and the disability "event" as binary. Different types and stages of cancer might have different associations with the disability you have in mind, and disability is typically graded rather than all-or-none. I'm providing an answer assuming that a survival model with time-varying covariates is appropriate, but consider whether this survival model is the best way to represent your data.

EdM
  • 92,183
  • 10
  • 92
  • 267
  • thank you for your great reply. I need to digest it. To me, the cancer as exposure changes over time, then why should I check the assumption for that. Also, in the cox regression (time dependent cox) I have adjusted for other variables which are measured at inclusion, this means that they are not time-varying. – user358238 Sep 09 '22 at 18:05
  • What you mean is that I should check the PH assumption for all the variables and cancer as well. Do you suggest to looking into both 1) test (chi-square test), using cox.zph(res.cox) in R 2) scaled scaled Schoenfeld residuals? what if some p-values in the test 1 is not high? Should I stratify to those variables in the cox model? or what ? – user358238 Sep 09 '22 at 18:05
  • I read some documents and even the book by Kleinbaum. I realized no need to check the ph assumption afterwards. – user358238 Sep 09 '22 at 18:07
  • please look at the text from Kleinbaum book in the update. – user358238 Sep 09 '22 at 18:23
  • Also, in another page of the book says: If we now calculate the estimated hazard ratio that compares exposed to unexposed persons at time t, we obtain the formula shown here; that is, HR “hat” equals the exponential of b “hat” plus d “hat” times t. This formula says that the hazard ratio is a function of time; in particular, if d “hat” is positive, then the hazard ratio increases with increasing time. Thus, the hazard ratio in this example is certainly not constant, so that the PH assumption is not satisfied for this model. – user358238 Sep 09 '22 at 18:29
  • @user358238 the derivation of the Cox PH model in Chapter 3 of Therneau and Grambsch explicitly allows for covariates that vary over time. PH is still possible in that situation, provided that the association between the covariate and log-hazard is constant over time. The covariate values themselves don't need to be constant in time. You should check all predictors for PH, but it's not necessarily fatal if PH doesn't hold. You should at least describe the violation over time, as I suggested. – EdM Sep 09 '22 at 19:32
  • @ EdM thanks. But, I checked the PH assumption and for some of the covariates (I have 10 covariates) the p-value is small , so the PH assumption is violated, then what I should do at this stage. let's say for e.g smoking, depression and BMI is not fulfilled how to described it and is it ok to still go with that cox? – user358238 Sep 09 '22 at 19:38
  • @user358238 Kleinbaum and Klein (KK) use "PH" in a more restricted way than Therneau and Grambsch (TG). KK consider proportionality of hazards among individuals: They say (page 245) that PH "means that the hazard for one Individual is proportional to the hazard for any other individual, where the proportionality constant is independent of time" (emphasis added). In that case, time-varying covariates would make individual PH impossible. TG consider PH with respect to associations of covariate values with outcome over time, which can be constant and are what cox.zph() evauates. – EdM Sep 09 '22 at 19:39
  • @ EdM Is that mean if the assumption for some variables is not fulfilled based on cox.zph(), then I need to stratify them in the model? or what do you suggest? – user358238 Sep 09 '22 at 19:58
  • @user358238 you have a large data set as I recall, so "statistically significant" violations of PH are almost certain. You must apply your understanding of the subject matter along with plots of estimated coefficient values over time (from Schoenfeld plots) to decide whether those are "practically significant." See this answer among many others on this site. Also, see Section 20.7 of Harrell's RMS text. – EdM Sep 09 '22 at 20:00
  • @ EdM thanks. I will read them. So, it is not like what I usually do. If the PH is not fulfilled then what I do, I stratify them in the cox reg. And, you know then they are excluded from the cox summary results. My interest is only interpret the HR from cancer status, not the other covariates. but anyway I need to adjust for them. – user358238 Sep 09 '22 at 20:20