0

I am running a Cox regression for a large (100k) dataset with 135 variables after dummy encoding.

Most of the coefficients are reasonable and the confidence intervals are not large. However, for one covariate, the confidence interval is enormous (0.1-80). For reference, the next largest upper bound is 6. I've checked the distribution of this covariate but nothing appears different from the others (i.e., no huge outliers, no strange distribution).

I decided that this information was pretty much useless and thought it might be best to simply remove this variable (especially because I think it could interfere with my results downstream too, namely that it violated the PH assumption and I'd have to add a time interaction, which I think lead to strange results because I'd be multiplying time by huge negative coefficients).

However, when I redo the analysis without this covariate, suddenly another (related) covariate is not very significant (though its coefficient is similar). My questions are:

  1. What could be causing such huge CIs in only this single covariate?
  2. Is my decision to eliminate this covariate based on such a finding just?
  3. What does it mean for my interpretation that now that the troublesome covariate is removed, another covariate becomes significant?
  4. Is my idea in the brackets of the third paragraph (starting: "especially because I think...") correct?
JED HK
  • 409
  • 1
    Sounds like multicollinearity to me -- what is the condition number of your predictor matrix, with and without this predictor? – jkpate Aug 08 '22 at 11:26
  • 3
    Expect such things when one of your indicator variable corresponds to a low frequency group. Removing variables is not the answer. But better understand X before proceding. It's uncommon for a 135 predictor model to be interpretable without first doing some data reduction (unsupervised learning). – Frank Harrell Aug 08 '22 at 11:27
  • @FrankHarrell my sole interest is the effect of the different levels of my treatment - why would data reduction still be required in this scenario? Surely it is better to leave all the covariates in – JED HK Aug 08 '22 at 11:33
  • @jkpate give me a little time to get this information to you – JED HK Aug 08 '22 at 11:33
  • 2
    You did not state the problem that way. If you have a single treatment variable of major interest then yes you may want to leave everything in. Still there is the possibility of overadjustment and you may want to put some restrictions on the adjustment covariates. Data reduction is often a good way to do that. – Frank Harrell Aug 08 '22 at 12:17
  • @jkpate indeed there is severe mulitcollinearity between the two variables (this also makes biological sense). I think removing one is a wise choice. – JED HK Aug 08 '22 at 13:44

0 Answers0