I may end up asking several follow-up questions in different posts, hence the broad view of my problem.
Context and Background
I have a function which is very expensive to calculate. Imagine simulating a power plant, a complicated circuit board, or a mechanical engine. A single simulation can take hours or even days of wall-clock time when run on a supercomputer. Abstractly, it is just a mathematical function which has some inputs (could be dozens of inputs or even a hundred inputs, like the geometry of the device) and a couple of outputs (like power output or efficiency). My ultimate goal is to optimize a given device design within the given constraints, such as maximizing power output within the geometrical constraints provided to me. Local optimums are fine. It doesn't have to be a global optimum but if it is, great.
My Thought Process
The function is highly non-linear and extremely expensive. We have no hope of knowing anything about the global structure of this function. So I thought to myself, let me run a few (10-100) simulations beforehand, train an ML model, like a neural network (a surrogate model?), then use this NN to quickly optimize the design and get close the true optimum, then go back to the expensive function and hone-in on the true optimum in a couple of expensive evaluations.
I decided to try neural networks first. I can move onto other methods if NNs fail. I have played with them for some time and I have a few questions to which I cannot find a clear answer.
My reflex was to simply use the coefficient of determination $r^2$ as a way to measure how well an NN approximates my expensive function. Specifically, $$r^2=1-\frac{SS_{res}}{SS_{total}}.$$
But recently several books/articles (including lots of posts here) suggest that $r^2$ is only appropriate for linear least squares method. Is this true? Is $r^2$ inappropriate even for nonlinear least squares?
Is there a difference between $R^2$ and $r^2$? What does $R^2$ even denote? I had no idea that there were different definitions. Can someone please explain what the different definitions are and when should they be used, their pros/cons, etc. in a brief explanation?