1

I'm training a pretty standard LightGBM regressor and noticing a strange pattern with the residuals (see images below--I'm bunching the predicted values and taking the observed average for the group). On observations with high fitted values, we consistently underestimate the observed response variable and on observations with low fitted values we consistently overestimate the observed variable. My questions are why would this pattern emerge and what can I do to mitigate it (beyond applying a post-hoc adjustment).

As for why the pattern emerges, I'm confused why the model wouldn't just take the observations it's already predicting to be particularly high/low and push them to be more extreme. I saw responses to a similar question that suggested this can happen when there is a lot of unexplained variance in the target variable. That may be the case here, but I would still think that the model could reduce loss by pushing all its current predictions to the extremes. Is that perhaps not possible?

A few more details

  • I've messed around with a bunch of different hyperparameters and the same patterns seem to appear. The model results below come from a model with learning_rate=.01, n_estimators=1000, num_leaves=63, subsample=.8, subsample_freq=5, reg_lambda=0.
  • There are a few million observations in the training and test set. The target variable is generally somewhat normally distributed but one in every couple hundred observations is a huge positive outlier (5-10 standard devs above the mean)

enter image description here enter image description here

dfried
  • 201
  • Just to be clear, my question isn't why the predictions have less variance than the observed (I know that the observed has a much greater variance) but rather why the expected value of the observed differs from the predicted in this pattern – dfried Dec 14 '23 at 02:59

0 Answers0