2

Suppose I have heteroscedastic data in which error terms increase for larger data points.

Assuming that either of these appear to fit the data well, which is the correct model to use, and why?

$Y = \beta X + \epsilon X$

or

$Y = \beta X + \epsilon Y$

where in both cases $\epsilon$ comes from $\mathcal{N}(\mu=1,\sigma)$

For prediction the second model is clearly less useful as the value of $Y$ is unknown, but if we are doing inference is there any advantage to it - such as, for example, the ability to use the fitted value of $\sigma$ as a measure of model performance?

Sideshow Bob
  • 1,485
  • This doesn't directly answer your question so I've put it as a comment. I was always taught that to fix heteroscedastic data that you should attempt to transform Y (i.e. box-cox transformation) to get homoscedastic data and then fit your model onto the transformed values of Y. This then still enables predictions. – gowerc Mar 18 '18 at 21:40
  • Interesting, were you taught in that case to restructure the model - as transforming your data effectively restructures the model altogether if it is multivariate? – Sideshow Bob Mar 19 '18 at 07:38
  • Essentially yes. You now model the transformed value and then reverse the transformation to get your predicted values of Y. From anecdotal evidence the explanatory variables in the model tend not to change with it mostly just requiring a rescaling of the model coefficients. In a linear model though this still enables significance testing of specific coeficiants but direct interpretation of their effects is much harder. – gowerc Mar 19 '18 at 07:56
  • Fair enough, though I think that approach might run into problems where your model is supposed to have a direct physical interpretation rather than just being an attempt to detect patterns in data. (This is the case with my current one although I admit I didn't say as much in the OP). – Sideshow Bob Mar 19 '18 at 08:34
  • The reason we are transforming Y and/or the X's is because we hope to get a plot of residuals versus fitted values which shows a constant level of scatter along the horizontal axis displying the fitted values, thereby indicating the assumption of homoscedastic model errors is reasonable. Sometimes, even after transformation of Y and/or the X's, we still get the "funnel" effect in that plot. In those cases, we have no choice but to either model the heteroscedasticity of the model errors directly or ignore it but correct the standard errors of the estimated model coefficients for it. – Isabella Ghement Mar 19 '18 at 15:27
  • One possibility is to try a gamma glm, which is appropriate for constant coefficient of variation – kjetil b halvorsen Mar 02 '24 at 22:03

1 Answers1

2

Once you fit the model which assumes homoscedastic errors to your data, you can check the plot of residuals versus fitted values. If that plot exhibits the "funnel" effect (i.e., increased variability as you move along the horizontal axis corresponding to the fitted values), then you need to investigate what kind of "fix" will be required to address the presence of the "funnel" effect (which signals heteroscedastic model errors).

If your model only includes one predictor, X, which is continuous, you can update the model to allow the error variability to increase with X. If X is categorical, you can allow different variabilities across categories.

If your model includes multiple predictors, it might be easier to have the error variability depend on the fitted values.

I have not seen any situations where the variability would depend on the values of Y and that's because usually we're aiming to model the conditional variability of Y given the predictor variables X, etc. For this reason, it makes more sense have the variability depend on known things like X or the fitted values.

Isabella Ghement
  • 20,314
  • 2
  • 34
  • 58