I am trying to create a linear model where the dependent variable has the following summary features:
| Min | 1st Qu. | Median | Mean | 3rd Qu. | Max |
|---|---|---|---|---|---|
| 0.1579 | 0.3155 | 0.3547 | 0.3459 | 0.3827 | 0.4583 |
There were some notable outliers on both ends, so I used a robust linear regression. In training the model, I logged both the dependent variable and the two main features feeding into the model, and the resulting predictions had the following skew:
After experimenting with a beta regression using the betareg package, the results look much better on the low end of the predictions, but the model still systematically under-predicts on the high end. Any suggestions?
As Dave noted below, here is a better representation of the problem I laid out - the model is systematically under-predicting on the high end.




Rcode demonstrates that.set.seed(2023); N <- 1000; x <- rbeta(N, 5, 1); e <- rnorm(N, 0, 1); y <- 14*x + e; L <- lm(y ~ x); yhat <- predict(L); qqnorm(yhat); qqline(yhat)Theyhere seems to have the same kind of skew as yours has. – Dave Nov 01 '23 at 16:53