1

I posted a question Models for fully correlated data where I asked about situations involving regression modelling of correlated covariates. I thought in standard regression, covariates can not be correlated (e.g. for the same patient, newer values of a variable x1 have to be independent from older values of x1 for this same patient) as this will result in multicollinearity.

Are there models which result which can handle correlated predictor/covariate variables? I take it that creating covariates based on lagged versions of the same covariates, and covariates based on lagged versions of the response are not allowed in GLM style regressions? Other than causing problems for the first observation for each patient (i.e. how can you have a lagged value for the first measurement?), this will likely violate assumptions?

Can someone recommend statistical models for these kinds of situations?

  • 2
    What do you mean by “allowed”? You can calculate, say, the usual OLS $\hat\beta_{ols}(X^TX)^{-1}X^Ty$ when you have correlated features. A drawback is an inflation of standard errors beyond what they would be, all else equal, if the features were uncorrelated, but this isn’t, necessarily, particularly problematic. $//$ Misconceptions abound when it comes to feature correlations, and you wouldn’t be the first person to think features should be independent. Perhaps read by community wiki answer here for my thoughts on the origin of such myths. – Dave Nov 25 '23 at 06:04
  • thanks! did you have an opinion about my previous question? –  Nov 25 '23 at 14:56

1 Answers1

7

A common myth in regression analysis is that explanatory or independent variables must be uncorrelated with each other. This is not true. While high correlation among predictors can be problematic, modest correlations do not usually pose significant issues, and with real world observational data we would expect some correlations among the independent variables.

Very high multicollinearity can lead to inflated standard errors, making it difficult to determine the individual effect of each predictor. However, it does not affect the overall fit of the model or its predictive power. Very high multicollinearity is typically detected using variance inflation factors (VIFs) or condition indices. When encountered, it can be managed through techniques like principal component regression, ridge regression, or by dropping one of the correlated variables.

If you have correlated data due to repeated measures, then using mixed effects models (with an appropriate residual covariance structure such as AR(1), Toepliz, or unstructured), Markov models or generalised estimating equations can be a good choice.

Robert Long
  • 60,630
  • thanks! did you have a opinion on my previous question? –  Nov 25 '23 at 14:54
  • "making it difficult to determine the individual effect of each predictor" -- this can happen not only because of the inflated standard error. Imagine two correlated variables which measure the same thing. In some models, the two may have different signs making it seem like one increases the expectation while the other decreases. That couldn't be the case in reality because the variables measure the same underlying variable, but it can happen due to high correlation! – Demetri Pananos Nov 25 '23 at 15:46
  • I've had a look at your other question but I don't really have much to add that isn't mentioned in the comments. I think that question could do with a bit more focus - describe the study design, and your research question more clearly (perhaps delete that question and ask a new one) – Robert Long Nov 25 '23 at 19:37
  • thanks! any ideas? –  Nov 27 '23 at 02:22