0

I've seen similar questions on here, but none seem to quite apply to my use case.

I want to predict Metacritic scores bases on a number of features. Metacritic scores are bounded to a 0-100 scale, using Python's GradientBoostRegressor I do however get predictions which are outside of this bound (i.e. <0 or >100) - this is with a solid R² score of 89. How can I prevent this behaviour? I could just bound all outputs to that scale after prediction (so something as simple as y_pred[y_pred > 100] = 100), but that seems like cheating. And while there is an improvement in scores by doing that, it is very much negligible.

Ideally I'd like to incorporate this into the model itself as it's just such an obvious thing, although I can't seem to find how and if that is possible at all.

Readler
  • 151
  • 2
    Why is predicting scores above 100 or below 0 a problem? Why is clamping the scores outside of the [0, 100] interval a bad solution? In what sense is it "cheating," and why is "cheating" in this sense a bad thing? – Sycorax Sep 03 '20 at 15:58
  • @Sycorax My intuition is that those are obviously wrong predictions and therefore problematic. It is also intuitively "cheating" because I am thinking of the regressor as a function which we can stretch and compress. I thought by manually setting the bounds, we might also be able to change all other output values. The closest thing to that would be to normalise to the [0,100] interval I suppose, though I was thinking of there being a solution before reaching the predictions. – Readler Sep 03 '20 at 16:07
  • You could divide your target by 100 and follow the procedure suggested here. Is there a reason this does not suit your needs? https://stats.stackexchange.com/questions/204154/classification-with-gradient-boosting-how-to-keep-the-prediction-in-0-1 – Sycorax Sep 03 '20 at 16:21
  • 1
    If you are willing to take a probabilistic approach (or full bayesian even), a natural way to bound your predictions is to use a sensibly bounded likelihood, like a truncated distribution for example. Sometimes working with these probabilistic models allows you to encode information about how things behave near the boundaries and improve the predictions in that region. If the large majority of data is in the interior, then likely your predictions will be too and this might not matter much, i.e won't be much different than bounded predictions after the fact. – Tyrel Stokes Sep 03 '20 at 16:42
  • The response is bounded by 0 and 100 but the responses aren't? – Michael M Sep 03 '20 at 17:14
  • @Sycorax I struggle to understand the solution provided in your post. Would you mind explaining it a bit further? – Readler Sep 03 '20 at 17:46
  • @TyrelStokes That would require completely starting from scratch no? – Readler Sep 03 '20 at 17:46
  • @MichaelM The target variable is bounded by [0,100], yet the predictions go beyond the bounds. – Readler Sep 03 '20 at 17:46
  • I doubt this is a bug -- On its face, this appears to be the consequence of using an unbounded function to predict bounded quantities, similar to using OLS to estimate a binary response. This also suggests the path to a solution: use some transformation to constrain the predictions to the correct range. Writing the loss function as a log-likelihood for a particular probability model is one suggestion of how to do this. – Sycorax Sep 03 '20 at 18:03
  • How can a combination of decision trees predict outside the observed range? I have never seen such thing. – Michael M Sep 03 '20 at 19:03
  • 1
    @MichaelM GradientBoostRegressor is a boosted decision tree. The boosting weights can be any number. The sum $S$ of the trees' predictions gives the model's prediction. You could transform $S$ to bound it, and that's basically the content of the question: which transformations make sense for this task. – Sycorax Sep 03 '20 at 19:57
  • I should have said that "GradientBoostRegressor is a boosted decision tree ensemble", but the point remains.
  • – Sycorax Sep 03 '20 at 20:07
  • Hmmm, I slowly start to understand, thanks! – Michael M Sep 03 '20 at 20:10