0

I am confused. I know there are a couple of similar questions about $R^2$ but I hope I get some opinions on this particular matter.

I have trained a random forest and other nonparametric regression models and I want to test their performance on unseen data. I want to measure their predictive accuracy.

I am an engineering student which is not particularly good at statistics. I know we must differentiate between measuring goodness of fit (GoF) and predictive accuracy. The difference being the former is measured on the training data and the latter on test data. But it does not mean that we must have different metrics for each. Correct me if I'm wrong, please.

I have read some references on the fact that $R^2$ should not be used for measuring GoF if our model is not linear or cant be transformed somehow to a linear model (Kvålseth, 1985) (Spiess and Neumeyer, 2010).

Now you may ask, which definition of $R^2$. Thats part of the confusion too. Lets take the most common ones:

$$ R_1^2 = 1 - \dfrac{\Sigma (y_{true} - y_{pred})^2}{\Sigma (y_{true} - \bar y_{true})^2}$$

The above version is the one that is used in the popular scikit-learn package in Python.

And $R_2^2$ being the squared correlation coefficient (Pearson's $R$). This one is used in the caret package in R.

The interpretation for both of them: The proportion of total variance of $y_{true}$ as explained by the fitted model.

Two things I gather from this:

  1. It is apprently only a measure for GoF
  2. Since it is a proportion, it is meaningless to be negative and MUST be between zero and one.

I want your opinion on this: in my field (hydrology) researchers use Nash–Sutcliffe efficiency (NSE) score, which is exactly calculated as $R_1^2$, as a way to measure predictive accuracy or power of a hydrological models which clearly are not linear. Their rational is that the model should do better than the benchmark, their benchmark being the $\bar y_{true}$. Therefore negative values of NSE means that our model is doing worse than the mean target.I have a feeling that this is fundamentally wrong. This benchmark estimator is vague and how we can have it on unseen data to being with? and also since NSE is basically $R_1^2$ we can not use it as a measure of predictive accuracy.

Now my questions:

  1. should/can I use $R_1^2$ to measure accuracy of my predictions and random forest?
  2. Can I use $R_2^2$ for the above-mentioned purpose?
  3. Besides metrics like MAE and RMSE, what are other options to qualify the performance of non parametric models on test data? in terms of accuracy or association,

Here is a subset of my test data prediction and observations:

\begin{array}{|c|c|} \hline {} & y\_true & y\_preds \\ \hline 0 & 3.745821 & 4.894624 \\ \hline 1 & 3.940449 & 5.743571 \\ \hline 2 & 2.849447 & 4.726890 \\ \hline 3 & 1.653091 & 2.659571 \\ \hline 4 & 2.934447 & 4.244686 \\ \hline 5 & 3.346146 & 5.269689 \\ \hline 6 & 2.450010 & 4.651610 \\ \hline 7 & 3.393356 & 5.122578 \\ \hline 8 & 0.791639 & 1.656736 \\ \hline 9 & 0.893791 & 1.935156 \\ \hline 10 & 0.129959 & 3.976739 \\ \hline 11 & 2.043000 & 4.072408 \\ \hline 12 & 4.298383 & 4.357470 \\ \hline 13 & 3.115428 & 4.432231 \\ \hline 14 & 4.325494 & 4.599493 \\ \hline \end{array}

(The values are volume of daily evapotranspiration in mm)

for this subset and my random forest:

$R_1^2 = -0.87$ and $R_2^2 = 0.55$.

  • Does this answer your question? Using $R^2$ for RF – Dave Jan 04 '21 at 20:46
  • @Dave thank you for pointing out the post. It is not as clear as I need in terms of GoF and predictive accuracy. Also I had some other concerns about NSE too, which it would be great to have a discussion about it here. As you see I would get a negative R squared for my test set and I fail to interpret in terms of "explaining variance/deviance". – Alireza Amani Jan 04 '21 at 20:51
  • 1
    The trouble is that there is no magic number that gives you an $\text{A}$ grade. In some tasks, a value of $0.9$ might be awful, while, in some tasks, a value of $0.6$ might make you a trillionaire. So maybe your scaled $R^2$-type of measurement means that you have performance $0.8$. Out of context, that lacks meaning. – Dave Jan 04 '21 at 20:53
  • @Dave, I understand. I am ok with the value I get from measuring MAE on this subset and I basically base my choices and comparisons on using MAE divided by the average target value, $\bar y_{true}$. That being said, its always good to have multiple measures to quantify different aspects. My problem is, if $R_1^2$ and $NSE$ should not be used for measuring predictive accuracy why is it used so frequently. And as I pointed out, I don't feel good how in hydrology they interpret negative NSE and NSE in general. – Alireza Amani Jan 04 '21 at 20:57
  • Mixing an MAE metric with the pooled mean (instead of median) seems like a mistake. 2) $R^2_1$ is not an invalid metric for nonlinear regressions. However, it is equivalent to MSE, and (in the nonlinear case) it lacks the interpretation as being some proportion of variance explained, due to the lack of orthogonality.
  • – Dave Jan 04 '21 at 21:10
  • On the use of MAE divided by mean: Kolassa & Schütz 2007, Foresight – Alireza Amani Jan 04 '21 at 23:57
  • Interesting! As it happens, Kolassa is a major contributor here who has answered a number of questions of mine! – Dave Jan 05 '21 at 00:00