1

In the book of Hosmer, Lemeshow, it is stated in the building phase of the model that:

Once we have obtained a model that we feel contains the essential variables, we should look at the variables in the model

The question of appropriate categories for discrete variables should have been addressed at the univariable stage.

But he hasn't explained how to do it!

Metariat
  • 2,526
  • 4
  • 24
  • 43
  • State exactly what you are trying to accomplish. Typically we think of a reference cell coding, with the reference cell and order of categories being quite arbitrary. Specific predictions and contrasts handle this arbitrariness painlessly. NOTE: there should not be a univariable stage as this will badly distort the final statistical inferences. – Frank Harrell Sep 21 '15 at 12:02
  • @Frank Harrell: Sorry for the unclearance in the question. I'm on the stage of building the model for the LR. First step: choose the appropriate variable (forward, backward,stepwise, lasso,...) and then once this step done, there comes the second step: for the continuous variables: check the linear assumption. And for the discrete variable, he stated that we should question about the appropriate categories. I don't know what he means by that. I imagine that it's about regrouping the categories in the meaningful way? So regrouping the categories having the similar level of logit? – Metariat Sep 21 '15 at 12:11
  • 1
    This whole strategy is a disaster in my humble opinion. Spend your effort specifying the model, then fit that model, do a few diagnostics, and be done. You can tell you might have done things correctly by the presence of insignificant variables in the model. For variables not known to act linearly allow them to be nonlinear using e.g. restricted cubic splines. Details are in my handouts linked from http://biostat.mc.vanderbilt.edu/rms – Frank Harrell Sep 21 '15 at 13:03
  • At this question: http://stats.stackexchange.com/q/90263/77852. @Frank Harrell said "univariate analysis can cause an amazing amount of damage when done before multivariable analysis, because there is a temptation to use the uivariate results in guiding model building". But in general you reduce the categories to having enough number of cases (some test require at least 5), transformations (logs, x^2, etc.) – Robert Sep 21 '15 at 14:41

1 Answers1

0

I don't know how to make Frank Harrell's comment an answer so I copy it here:

This whole strategy is a disaster in my humble opinion. Spend your effort specifying the model, then fit that model, do a few diagnostics, and be done. You can tell you might have done things correctly by the presence of insignificant variables in the model. For variables not known to act linearly allow them to be nonlinear using e.g. restricted cubic splines. Details are in my handouts linked from biostat.mc.vanderbilt.edu/rms

Metariat
  • 2,526
  • 4
  • 24
  • 43