Modeling Expensive Functions Using ML (Q1), Is $r^2$ Appropriate?

Question

I may end up asking several follow-up questions in different posts, hence the broad view of my problem.

Context and Background

I have a function which is very expensive to calculate. Imagine simulating a power plant, a complicated circuit board, or a mechanical engine. A single simulation can take hours or even days of wall-clock time when run on a supercomputer. Abstractly, it is just a mathematical function which has some inputs (could be dozens of inputs or even a hundred inputs, like the geometry of the device) and a couple of outputs (like power output or efficiency). My ultimate goal is to optimize a given device design within the given constraints, such as maximizing power output within the geometrical constraints provided to me. Local optimums are fine. It doesn't have to be a global optimum but if it is, great.

My Thought Process

The function is highly non-linear and extremely expensive. We have no hope of knowing anything about the global structure of this function. So I thought to myself, let me run a few (10-100) simulations beforehand, train an ML model, like a neural network (a surrogate model?), then use this NN to quickly optimize the design and get close the true optimum, then go back to the expensive function and hone-in on the true optimum in a couple of expensive evaluations.

I decided to try neural networks first. I can move onto other methods if NNs fail. I have played with them for some time and I have a few questions to which I cannot find a clear answer.

My reflex was to simply use the coefficient of determination $r^2$ as a way to measure how well an NN approximates my expensive function. Specifically, $$r^2=1-\frac{SS_{res}}{SS_{total}}.$$

But recently several books/articles (including lots of posts here) suggest that $r^2$ is only appropriate for linear least squares method. Is this true? Is $r^2$ inappropriate even for nonlinear least squares?

Is there a difference between $R^2$ and $r^2$? What does $R^2$ even denote? I had no idea that there were different definitions. Can someone please explain what the different definitions are and when should they be used, their pros/cons, etc. in a brief explanation?

Dave · Answer 1 · 2023-04-20T10:54:35.517

Some people might use $r^2$ to denote the Pearson correlation between the predictions and true values, $r^2 = \left(\text{corr}\left(\hat y, y\right)\right)^2$, while using $R^2$ to mean the formula you gave.

There are serious problems with just squaring the Pearson correlation between the predictions and true values. I go through some of the math here and show some images here. In that regard, the $r^2$ is problematic and can miss that your predictions are terrible. A reason people use it is that $r^2$ and $R^2$ coincide for OLS linear regression with an intercept, which is an extremely common regression technique that almost anyone who uses statistics has learned.

If you take $R^2 = 1 - \dfrac{SS_{res}}{SS_{total}}$ as you have, then $R^2$ is just a transformation of the sum of squared residuals, and the sum of squared residuals is an extremely reasonable way to assess a regression model.

A common complaint about $R^2$ is that it can be driven (close to) a perfect $R^2=1$ by overfitting to the training data. This is valid, but do the algebra: that corresponds to $SS_{res}=0$ (or close to zero if $R^2=1$ is impossible). Thus, such a complaint is also a complaint about $SS_{res}$ $\big($ditto for $MSE=SS_{res}/N$ and $RMSE = \sqrt{MSE}$$\big)$.

A way to keep your modeling honest is to check performance on some holdout data. There is disagreement about what constitutes an out-of-sample $R^2$ calculation, though if you pick one that is a transformation, such as the $ R^2_{\text{out-of-sample, Dave}} $ that I discuss in the link, you are evaluating the out-of-sample sum of squared residuals, perhaps in a way that gives some context to the value as a comparison to the performance of a baseline model.

Evaluating the out-of-sample sum of squared residuals, or some function of it like $R^2$, $MSE=\dfrac{SS_{Res}}{N}$ (for $N$ predictions), or $RMSE = \sqrt{MSE}$, makes total sense, whether the model is linear or not. Other measures of performance might make more sense (e.g., mean absolute error, $MAE = \overset{N}{\underset{i=1}{\sum}}\left\vert y_i - \hat y_i\right\vert$), but that is a separate conversation. If what you find interesting is a measure of square loss, $SS_{res}$, $MSE$, $RMSE$, and $R^2 = 1 - \frac{SS_{res}}{SS_{total}}$ are transformations of each other and, in some sense, convey the same information.

Modeling Expensive Functions Using ML (Q1), Is $r^2$ Appropriate?

Context and Background

My Thought Process

1 Answers1