2

I am estimating a logistic regression on a subset of my data and predicting the outcome for the whole sample.

  1. In the regression only some data are statistically significant at the chosen level. Should I include only those regressors in the prediction or all of them?
  2. I cannot chose my preferred model specification as it includes years Fixed Effects and there is no perfect overlapping year-wise between the samples. Should I just go for my second-best specification without the Fixed Effects?
  3. How do I go about the first point in R?

For reference, my code in R looks like this

db_tr <- data[data$group=="1",]
db_pr <- data[data$group=="2",]

f <- formula(y ~ x1 + x2 + x2 + factor(year) logit.training <- glm(f,family=binomial(link="logit"), data= db_tr) logit.predict <- predict(logit.training, newdata= db_pr, type="response")

MCS
  • 47
  • I'm having trouble understanding your description in point 2. Is it possible to clarify this? – user20160 Jan 05 '21 at 19:39
  • My toy database has "year" among its variables that I use for time fixed effects. However if the training database (i.e. db_tr) covers say years 1 to 5 and the prediction dataset (i.e. db_pr) covers years 6 to 10 with no overlap, despite getting a more precisely estimated estimates if I include time fixed effects, I cannot include them in the formula because I get the following error in R: Error in model.frame.default(Terms, newdata, na.action = na.action, xlev = object$xlevels): factor factor(year) has new levels 6, 7, 8, 9, 10: – MCS Jan 06 '21 at 16:35
  • "Which variables to include?" I would refer you to Hosmer & Lemenshow's book "Applied Logistic Regression", chapter 4: Model-Building Strategies and Methods for Logistic Regression, and in particular Section 4.2 Purposeful Selection of Covariates. – ColorStatistics Jan 06 '21 at 18:17

2 Answers2

1

Not sure what will constitute "reputable" in your eyes, but here is an approach.

  1. There is no need to remove predictors which fail to achieve statistical significance. I suspect the thinking here is that statistical significance is some sign of importance, but that isn't completely true. Statistically significant effects can be small, and have little impact on the predicted probabilities. Additionally, large effects can have large uncertainty. Removing these may hinder the model.

  2. I'm not sure quite what you mean here. I see you're modelling year as a factor, which I would advise against. Time should be modelled as a continuous variable, if not only because in the model will not be usable in future years. For example, if your data had years 2015 through 2020, and you passed the model data from 2021, the 2021 is not recognized as a factor level, and thus has no regression coefficient. Modelling the year variable as continuous avoids this and allows years not in the training data to be predicted on. You can easily verify this in R.

  • My struggle lays specifically with the uncertainty to whether include large effects with large uncertainty (not statistically significant) that would skew my predicted probabilities.
  • – MCS Jan 06 '21 at 18:38
  • Your suggestion certainly solves the error I receive, but I was hoping to include in my model Year-Specific Fixed Effects using dummy variables (LSDV Model). Would not model years as continuous variable fail to do that?
  • – MCS Jan 06 '21 at 18:42