Are there any sanity checks for high R2 value?

Question

I am training a RandomForestRegressor Model with Scikit-Learn to model a physical process. The dataset has the following properties:

450’000 samples, 42 features
Train test split of 80/20

When calculating the test score, I get a very high test R2 value of 0.985 and a test RMSE of 0.71. I did visualize the results in the plot below (y_pred vs. y_test):

[UPDATE1] added figure with s=1 and alpha = 0.1

[UPDATE2] added histogram with residuals

I am a bit suspicious, about the high test R2 value. Even though the plot indicates a relatively good fit, the test and prediction value don’t match “perfectly”. I don’t have much intuition about what to expect from the R2 value in this case.

Does anyone have an idea, if such a high R2 is plausible? Are there any sanity checks (next to using a test set), that I could apply to verify my results? Thank you!

it all comes down to the nature of the data and what you know or do not know about the data generation process. In many fields $R^2 \ll 1$ is a sign of incompetence (e.g. where the issue is measurement or sampling error in experiments and only very high predictability is acceptable). In others $R^2 \sim 1$ signals tautology or fraud or out-of-control overfitting, as inherent predictability is very low. In social science it would be scary if knowing gender, race and some categories were more than weakly predictive of any kind of outcome because so much comes down to individual-level detail. — Nick Cox, May 12 '20 at 10:02
Your plot is not too informative. How many out of these 450’000 observations do in fact have high standardized residuals? The plot may mask that these are just a very small portion since the points in your plot overlap. If you used CV to find the best parameter combination, then you need an independent, fresh data set to estimate oos-prediction accurately. — 00schneider, May 12 '20 at 10:12
Judging 90,000 points in a plot is difficult due to overplotting, as 00schneider has pointed out. For visual inspection you should make at least two more plots: the same as the one above but with very small dots an in a very transparent color(set alpha near zero) and you should calculate the residuals (prediction errors) and plot a histogram or density plots of how those are distributed. — Bernhard, May 12 '20 at 10:29
@Nick Cox, thank you for your comment! I am trying the model a physical process, therefore, I assume, it will be realistic to have a relatively high R2 value. However, I would not have expected a value of 0.99 (considering noise in the data, etc.) — lux7, May 12 '20 at 11:16
Thanks 00schneider and Bernhard, that is a very good point! I updated the questions with the two plots as you described. — lux7, May 12 '20 at 11:28
@00schneider, so far, I just did a train/test split of the data, standardized the values (only using the training set) and fit the model on the training set without hyperparameter tuning. Finally, did evaluate the model on the test set. — lux7, May 12 '20 at 11:31
If you do not trust your model, I think it is a good idea to check its predictive quality on other, independent data sets. — 00schneider, May 12 '20 at 11:37
When fit is very good a plot like this is quite reassuring, and good propaganda, but it is best to switch to a plot of residual versus predicted or fitted, to see better the structure the model is not capturing. — Nick Cox, May 12 '20 at 12:29

score 1 · Answer 1 · answered May 12 '20 at 11:36

1

Nick Cox had a very nice comment: in some data set, it is easy (and we can expect) to get a good performance.

In case, that we expect there are some cheating happening in the process, because from domain knowledge we know the features are not that indicative to the target. You may check some data leakage issues.

For example, training data and testing data are overlapping, and there are super strong cheating features exist.

A related post can be found here, and the answer in that post may be very helpful in your case.

How can I quickly detect cheating variables in large data?

answered May 12 '20 at 11:36

Haitao Du

36,852
25
145
242

1

Beware: You don't need to think of "cheating" in some legal sense. You could just as well be cheating yourself inadvertently. – kjetil b halvorsen May 12 '20 at 13:33
Overfitting and sheer dopeyness -- such as including a predictor that is just the outcome, more or less -- are immensely more common in my experience than statistical dishonesty. Thanks for the compliment on my comment, but I was being a little dramatic in mentioning fraud -- although clearly it does exist. – Nick Cox May 12 '20 at 13:37

Are there any sanity checks for high R2 value?

1 Answers1