I need some advice here. My dependent variable is binary (snows or not) and my independent variables are temperature, pressure, and some wind index. I have daily data for 30 years. The data are correlated (since they are daily) so I am afraid to use logistic regression. Any idea of a model that I could use?
-
Welcome to Cross Validated! What's wrong with correlated predictors? – Dave Aug 16 '22 at 13:35
-
The observations are correlated because the predictors are time series(daily temperature, daily pressure, daily wind index). One of the assumptions of logistic regression is that the observations should be independent. – lola Aug 16 '22 at 13:39
-
1There is no assumption about uncorrelated predictors, and it is routine to have correlated predictors. – Dave Aug 16 '22 at 13:43
-
1@Dave the outcomes, however, might be correlated in time. A long snowstorm might continue for 2 or more consecutive days. Long stretches of snowstorm-free days are found over several months of the year. – EdM Aug 16 '22 at 14:57
-
Do you have any idea about which statistical test to use? – lola Aug 17 '22 at 05:52
1 Answers
As Dave pointed out in a comment, having predictors correlated among themselves isn't a problem per se in regression. It can lead to wide confidence intervals for individual predictors, but the model as a whole can be useful if the predictors together are associated with outcome.
The main problem with time series is that the observations along time aren't independent, showing up in autocorrelations of error terms along time in the model. Even in that case with multiple regression models* (at least for continuously-distributed outcomes in models that meet the assumptions of linear regression):
our forecasts may be inefficient — there is some information left over which should be accounted for in the model in order to obtain better forecasts. The forecasts from a model with autocorrelated errors are still unbiased, and so are not “wrong”, but they will usually have larger prediction intervals than they need to.
See Section 5.3 of Hyndman and Athanasopoulos, Forecasting: Principles and Practice (2nd ed). One approach would be to fit your model and see whether there is autocorrelation along time in your error terms.
With your binary outcomes, a question is what you should use as the residual values that estimate the error terms.** This is a bit outside my expertise, but the following might be helpful.
A simple choice for residuals, the difference between the estimated probability of the event (between 0 and 1) and the observed value (either 0 or 1 exactly), might be misleading as the expected variance is a function of the event probability, rather than constant as assumed in standard linear regression. The DHARMa package in R provides simulated standardized residuals that can be more useful for generalized linear models like logistic regression, and a Durbin-Watson test for their temporal autocorrelation.
If you don't have substantial temporal autocorrelation of residuals, then the time-series aspect of the data isn't a problem. If you do have autocorrelation of errors, you could try some of the approaches suggested by Hyndman and Athanasopoulos; although they are presented in the context of continuous outcomes, the regression principles are similar.
With autocorrelation you could consider a Markov-type model that includes as an additional predictor the outcome state of the previous day (or multiple days and associated time intervals). Frank Harrell suggests that as an approach even for continuous-outcome time series evaluated via ordinal logistic regression; see Section 7.8.4 of his course notes. For example, if there is mainly a lag-1 autocorrelation, you could add to your binary regression model a term for a function of the prior outcome value (and perhaps time).
There's also a glarma package designed for binomial and count-type outcomes in time series that might be useful, but I don't have experience with it.
Finally, I do wonder whether a binary snow/no-snow dichotomy will be helpful. You must have an arbitrary cutoff of whether a dusting of snow counts as a snow day, and there is no distinction between nuisance snowfalls of a few inches and major snow events. Might a more continuous measure (say, actual snowfall or its water equivalent) be better? That, of course, depends on how you want to use your model.
*What you have is better termed a "logistic multiple regression" rather than a "multivariate logistic regression." The current recommendation is to reserve the word "multivariate" for multiple outcomes rather than multiple predictors, although there is much inconsistency in practice.
**Also, with binary regression, there can be an omitted-variable bias that's more of a problem than in linear regression. In linear regression, you can have such bias if an omitted predictor is associated both with outcome and with the included predictors. In binomial regression, leaving out any outcome-associated predictor can lead to bias. I'm not sure how to think about that bias in the context of autocorrelation of errors.
- 92,183
- 10
- 92
- 267