Showing machine learning results are statistically irrelevant

Question

This is a question as part of a paper review which was already published. The authors of the paper publish $R^2$ and RMSE in training but only RMSE in validation. Utilizing the published code, $R^2$ can be calculated on the validation data and is in fact negative in all cases, while RMSE matches what is published. It is a regression rather than classification task. There are roughly $45$ test cases using $2$ separate models (RF, ANN), meaning around $90$ generated models and $90$ predictions. Only $2$-$3$ of the $90$ predictions have a positive $R^2$ value and those are all below $0.1$!

I am trying to convince my team that the results are poor, but they want to ignore the $R^2$ findings and suggest that a "good" RMSE is enough. The RMSE looks okay but based on a hunch (negative $R^2$) I made two additional models (mean and last sample) which often match or beat the RMSE of the RF and ANN models published in the paper. The mean model just takes the mean of training and uses that in all predictions. The dataset is a timeseries (time-varying, usually $1$-$2$ samples per week), so the last sample model just uses the previous sample's value.

As my team wants to ignore the bad $R^2$, is there another way to show that the paper's RF and ANN models do not produce statistically relevant results? Perhaps there's a statistical test that I can use to show the results are not significant but I'm not sure where to begin.

As an aside, the problem in this domain is often also formulated as a binary classification task with a given threshold. In this direction, the paper's code actually manually attempts to calculate AUROC but appears to fail in doing so. However, the details of the AUROC calculation are not provided in the paper, leaving readers to assume that the standard AUROC method is applied! Rather than using a library to calculate the AUROC, the code calculates it manually using some sort of a bootstrapping process. When I use sklearn's scoring methods for AUROC, it appears all of the $90$ models are around or below $0.5$ (i.e. completely random or even broken!) Perhaps $1$-$3$ models (out of $90$) make a prediction around $0.6$ or $0.7$. Again, the team wants to ignore this as the main focus of the paper is apparently the regression task.

Edit: Regarding a negative $R^2$ value, the authors calculate $R^2$ using sklearn's r2_score method (https://scikit-learn.org/stable/modules/generated/sklearn.metrics.r2_score.html). According to the documentation: "Best possible score is 1.0 and it can be negative (because the model can be arbitrarily worse)."

Edit 2: This question was previously posted on Data Science (https://datascience.stackexchange.com/questions/112554/showing-machine-learning-results-are-statistically-irrelevant) but moved here after feedback. Before moving, the feedback there suggested a few things including: $0$ or less $R^2$ means that a guess would be better (which is why I included models for mean and t-1); and perhaps it's wise to be skeptical of such a model. Also, it should be noted that as a team we're looking to improve on the paper's results leading to a publication. Perhaps, to help prove insignificance of the results, I could simply show a tally of how many times mean/last sample beat or match the paper's models? (Based on both RMSE and $R^2$, the mean model beat the paper's models in a subset of 17/30 tests which we're presently reviewing.)

@knrumsey Brings up a good point that, traditionally, $R^2 \ge 0$, so it is important to mention what definition you are using that allows for $R^2<0$. (I assume it is some kind of out-of-sample metric, but not everyone even agrees on what is a reasonable out-of-sample $R^2$.) — Dave, Jul 11 '22 at 17:10
For your own knowledge, you might want to know how sklearn.metrics.r2_score works. (For the record, as I have posted on Cross Validated before, I disagree with their definition.) — Dave, Jul 11 '22 at 17:23
@Dave there's nothing wrong with sklearn’s definition of $R^2$. It can be negative for really bad models. — Tim, Jul 11 '22 at 18:45
@Tim It’s not a matter of getting negative values. I dislike the comparison to the out-of-sample mean rather than the in-sample mean. — Dave, Jul 11 '22 at 18:52
@Dave sklearn uses in-sample mean. $R^2$ can be negative if you use the in-sample mean as a comparison, e.g. if your model makes completely random predictions. — Tim, Jul 11 '22 at 19:24
@Tim sklearn.metrics.r2_score only has the true and predicted values as arguments (at least arguments without defaults), so I'm not sure how it could use the in-sample mean. — Dave, Jul 12 '22 at 00:32
Dave is right. sklearn uses the out-of-sample mean, and its definition of $R^2$ makes no sense. — Flounderer, Jul 12 '22 at 05:21
@Dave it compares train prediction to train data vs train average to data, or the same with the test data if you use test data, not sure what you mean. — Tim, Jul 12 '22 at 06:30
The machine learning approach, if it involves no unsupervised learning, is probably doomed. It takes 70 observations just to be able to adequately estimate a standard deviation of one variable - see RMS Chapter 4. — Frank Harrell, Jul 12 '22 at 10:57
An R2 can easily be negative when the regression equation does not include a constant term. — JKP, Jul 14 '22 at 19:12
@Dave do you have a link to your previous discussion on their definition of R^2? — gabagool, Jul 15 '22 at 00:48

Tim · Accepted Answer · 2022-07-12T19:06:31.610

You answered yourself:

I made two additional models (mean and last sample) which often match or beat the RMSE of the RF and ANN models published in the paper. The mean model just takes the mean of training and uses that in all predictions. The dataset is a timeseries (time-varying, usually 1-2 samples per week), so the last sample model just uses the previous sample's value.

You benchmarked the result with trivial models and they outperform the model. This is enough to discard the model. What you did is a pretty standard procedure for validating time-series models.

Negative $R^2$ are consistent with your benchmarks. In fact, $R^2$ already compares the model to the mean model because it is defined as

$$ R^2 = 1 - \frac{\sum_i (y_i - \hat y_i)^2}{\sum_i (y_i - \bar y_i)^2} $$

so the numerator is the sum of squared errors of the model and the denominator is sum of squared errors of the mean model. Your model should have a smaller squared error than the mean model for it to be positive.

Maybe the authors of the published paper didn't run the sanity check? Many crappy results somehow get published.

I'm afraid that if reasonable arguments like comparing the results to the benchmarks don't convince your colleagues, I doubt “a statistical test” will. They already are willing to ignore the results they don't like, so it seems rather hopeless.

score 3 · Answer 2 · answered Jul 12 '22 at 19:38

Piggybacking on Tim's answer. You clearly already have trained a better model, so just show your colleagues its results.

Here's a note, though: R2 score could prove to be an unreliable metric depending on the problem. For example, a regression model predicting the price of a stock in the following day. Any small amount of correlation beyond a random guess (or convergence to the mean) would make you a millionaire!

To summarize, not all low R2 scores are bad.

Showing machine learning results are statistically irrelevant

2 Answers2