Does it make sense to calculate R2 on splits of test data based on target value percentile?

Question

I have an XGBoost regression model to predict a numeric target y. y is quite right-skewed when I plot histogram. For example, the top 5% of data accounts for 50% of the total sum. I do not want to filter them out, because they are important for my analysis, and besides, there is not a fixed cut-off point.

When I train the model using XGBRegressor that minimizes the squared error, I get an R2 of 0.76. I am curious on how the model fits to different target values. So I use score method of sklearn to calculate R2 to see the predictive power of my model on different subsets. I split the test data based on target value percentiles (y< i^th percentile vs y>=i^th percentile). Note that R2_high (low) refers to the subset of the test data where target values are higher (lower) than the percentile value. See below results:

Percentile	R2_high	R2_low	R2_overall
0.10	0.758	-35.9	0.76
0.20	0.75	-15.4	0.76
0.50	0.73	-2.54	0.76
0.90	0.63	0.375	0.76
0.95	0.56	0.53	0.76

Overall, I am satisfied with an R2 of 0.76. However, I would expect R2 value not to deviate so much as long as there are enough number of data points (the test set has ~100 000 rows). However, it deviates extremely. Is this normal?
It is reasonable that the model may try to fit to the large values more due to the objective of minimizing squared error. However, this is not exactly the case; for the top 5% the R2 is only 0.56.
For some subsets the R2 is terrible with negative values.
Does training separate models make sense? For example training 10 models such as dataset 1 include the top 10% y, dataset 2 has second top 10% and so on?

I would appreciate any help or suggestion greatly.

What does $R^2$ mean when you compare to a different value? It might help to write out the equation. — Dave, Oct 25 '21 at 15:34
"Note that R2_high (low) refers to the subset of the test data where target values are higher (lower) than the percentile value. See below results:" Please explain what this means. — Dave, Oct 25 '21 at 16:04
Ah ok. Well how it is calculated is given here: https://scikit-learn.org/stable/modules/model_evaluation.html#r2-score See section: 3.3.4.8. R² score, the coefficient of determination — volkan g, Oct 25 '21 at 16:06
What do you mean that the target values are higher than the percentile value? — Dave, Oct 25 '21 at 16:08
Say 10th percentile of all target values in the test set is 12. I divide the test set into two: Lower part is where target value<12, Higher part is where target value>12. I calculate R2 separately for the two sets. — volkan g, Oct 25 '21 at 16:27
If you're only predicting the values that are observed to be small, and you guess the mean, then you should expect your results to be awful. More concretely, if I want to predict the stock market losses when we have huge losses, I'm going to do poorly by guessing that market's average. It does not make sense to partition your data in this way. Where did you get this idea? — Dave, Oct 25 '21 at 16:30

score 0 · Answer 1 · answered Jun 01 '23 at 20:12

This seems iffy to me, as it requires you to condition on the outcome, which is the quantity you want to predict (and will not always know).

Sure, sklearn can handle the calculations just fine. When you pluf the predicted and true values into r2_score, you get a calculation based on the following.

$$ R^2_{\text{sklearn}}= 1-\left(\dfrac{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\hat y_i \right)^2 }{ \overset{N}{\underset{i=1}{\sum}}\left( y_i-\bar y \right)^2 }\right) $$

This $\bar y$ is the mean of the values you have given the r2_score function, that is, the values in the percentile range of interest.

The interpretation of this is that you compare the square loss of your model on the input data to the square loss of a model that predicts the same value $(\bar y)$ every time.

I find it questionable to predict this $\bar y$ every time, however. If you do it for in-sample data, then you are not considering the possibility of overfitting. If you do it for out-of-sample data, then it is not clear what $\bar y$ should be, since you are not supposed to know the true outcomes to be able to calculate a mean of them.

Put yourself in the position of having data where you truly do not know the outcome. I do not see a reasonable way to calculate the denominator. At least for a more standard out-of-sample $R^2$-style statistic, you can take $\bar y$ as the mean of the training data. Once you start applying restrictions on the values of $y$, however, the interpretation gets iffy.

Does it make sense to calculate R2 on splits of test data based on target value percentile?

1 Answers1