1

I am trying to predict a value using a linear regression, and I get an R squared of 0.63. My data is composed of 5 different groups (each with different characteristics). When I am taking the prediction of the model and calculate the R squared value for each group, I get a negative R squared for some of them, and much lower R squared for all of them:

group A - (-0.71)
group B - (-0.04)
group C - (0.16)
group D - (0.19)
group E - (-0.05)

I wonder of this is a good way of assessing the model's performance over the subgroups or am I doing something wrong here.

David
  • 113
  • You are doing something wrong here. – user2974951 May 18 '22 at 10:01
  • How are you calculating $R^2$ of the subgroups? – Dave May 18 '22 at 10:01
  • I am taking the prediction (y_pred), and the real value, and use sklearn.metrics.r2_score(y_test,y_pred) for all, and for each subgroup – David May 18 '22 at 10:08
  • So these are out-of-sample values? What about the $0.63?$ // What do you want this $R^2$ value to tell you? Please be as specific as you can be. – Dave May 18 '22 at 10:22
  • these out of sample. the 0.63 is on all the out of sample, while others are the out of sample on the subgroups. I want to get a feeling of model's performance on subgroups. – David May 18 '22 at 10:24
  • To get the R2 for each subgroup, you have to check the RSS conditional on that group, not the RSS of the whole sample with the coefficient of only that group. I second @Dave that you should be more specific in your question. – Olivier Hubert May 18 '22 at 10:37
  • I am calculating the RSS per group (with it's own mean). does that make sense? – David May 18 '22 at 10:44
  • 1
    This is Simpson's paradox in practice – Firebug Mar 22 '23 at 15:26
  • @Firebug I tried to allude to that in my answer without mentioning it by name. Since I am not sure that I have a handle on all of the details (hence my answer not addressing it by name), I would be interested in reading an answer that fleshes out this notion. – Dave Mar 22 '23 at 15:31

1 Answers1

2

How you perform and calculate such a statistic depends on what you want to learn by from it.

My belief about $R^2$ is that it is a comparison of how your model performs (in terms of square loss) vs how a naïve model performs when it just predicts the mean every time. With this in mind, there are two possibilities for calculating a subgroup $R^2$.

  1. Calculate the usual $R^2=1-\left(\dfrac{ \overset{N_{group}}{\underset{i=1}{\sum}}\left( y_i-\hat y_i \right)^2 }{ \overset{N_{group}}{\underset{i=1}{\sum}}\left( y_i-\bar y \right)^2 }\right) $, limited to the $N_{group}$ points in the group.

  2. Calculate $R^2=1-\left(\dfrac{ \overset{N_{group}}{\underset{i=1}{\sum}}\left( y_i-\hat y_i \right)^2 }{ \overset{N_{group}}{\underset{i=1}{\sum}}\left( y_i-\bar y_{group} \right)^2 }\right) $, using the mean of just that particular group.

Since you know the group membership, it seems legitimate do use the second formula, which will tell you how your model performs on that one group compared to how you would do if you predicted the mean of that group every time.

When you use the sklearn implementation like you do, I believe that you get this second statistic. There is an issue about using the in-sample vs out-of-sample mean in the sklearn implementation, but these value are (hopefully) quite close and will give similar results.

Your results tell you that, while the model does a better job of predicting (in terms of square loss) than predicting the same mean every time, for some groups, you would be better off predicting the mean of the group than you would if you used your model predictions.

I will venture a guess that, if you run a regression on just the group indicator variables (giving a model that predicts the group mean), you will have lower out-of-sample MSE than your existing model has. If you have many more instances of group $C$ and $D$ that have positive "grouped" $R^2$ than of the other groups, then this might not hold, but if the groups are roughly balanced, this is my prediction. You seem to do a better job of predicting by using the group means than by using your model predictions.

(If you take the stance that $R^2$ measures the proportion of explained variance, by limiting your analysis to just one group, you are cutting down the variance, so of course a smaller proportion of the variance is explained. There are issues about this, since such an interpretation of $R^2$ need not apply, but this might give you an intuition about why your grouped $R^2$ values are lower than your overall $R^2$ value.)

Dave
  • 62,186