1

I'm fitting SciKit-Learn's KNeighborsRegressor on a 5 dimensional space and my model performance is peaking at a score of $\sim 0$.

In their documentation they say that the score they're using is the following:

$$R^2 = (1 - \frac{u}{v})$$

which I believe is the formula for the Coefficient of Determination, making:

$$u = \sum^N_i(y_{p,i} - y_{t,i}), \\ v=\sum^N_i(y_{t,i} - \bar y_{t,i})$$

where $N$ is the number of samples, $y_{p, i}$ is the predicted value of sample $i$, $y_{t, i}$ is the true value of sample $i$, and $\bar y_{t, i}$ is the mean of the true samples.

This makes $R^2$ the "unexplained variance of the dataset". I'm struggling to understand what that means for my model's performance, and the SciKit-Learn documentation isn't much help:

The best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse). A constant model that always predicts the expected value of y, disregarding the input features, would get a $R^2$ score of 0.0.

When measuring model performance using the Coefficient of Determination, $R^2$, what is a good score?

Connor
  • 625
  • 1
    Would it be reasonable to ditch the [tag:sklearn] tag in favor of [tag:r-squared]? That the software being used is sklearn seems neither here nor there. – Dave Mar 22 '23 at 13:16

1 Answers1

4

In general, it is hard to say what constitutes a good score. I have seen papers in top journals that have $R^2<0.1$. At the same time, it might be the case for a different problem that $R^2=0.9$ is nothing worth celebrating.

An advantage that $R^2$ has over other measures of performance is that it inherently gives some kind of comparison to a baseline model; if you can’t beat the baseline model, your model isn’t helping. This corresponds to $R^2\le 0$, with equality denoting the exact same performance as the baseline model.

Thus, as long as you get $R^2>0$, you’re doing something useful in terms of predictive ability. At the very least, you are performing better than a reasonable baseline model. How much better than $0$ you need to be is going to depend on the problem and how others have performed. If you get the best value anyone has ever gotten, that sounds like good news! If you beat the baseline model but fall short of what most others can achieve, there is room to improve.

As with any machine learning task, watch out for overfitting.

Dave
  • 62,186
  • Ahhhh, I get it. So the score is basically saying how well you did vs how well you would do if you guessed the target's mean for everything? In which case, a score of 0 is pretty bad Thank you! Massive help. – Connor Jan 11 '23 at 00:28
  • 1
    @Connor That’s exactly what this formula means! // A score of $R^2=0$ is pretty bad, yes, but $R^2$ can go negative, depending on what you’re doing, and that’s even worse! – Dave Jan 11 '23 at 00:29
  • What do you think of the answer to this question: https://stats.stackexchange.com/questions/99171/why-is-euclidean-distance-not-a-good-metric-in-high-dimensions, seems like this might have some bearing on the issue at hand here! – Connor Jan 11 '23 at 22:15
  • @Connor What do you see as the connection to your question? // I do see some connections, but I appear to be in the minority. – Dave Jan 11 '23 at 22:17
  • Specifically, the part where they say that in higher dimensions all the points tend to spread out so much that there's little difference between their neighbour distances. This would give me exactly the results I'm getting. If the neighbour distances converge to some average, then you would expect the model to reproduce that average and score 0. – Connor Jan 11 '23 at 22:41
  • @Connor If you have some thoughts on the question of mine I linked, I absolutely welcome a post over there. – Dave Feb 11 '23 at 16:52