Find the minimum of an expensive-to-sample noisy paraboloid

Question

I have a function on the unit square that's expensive to evaluate, and want to estimate where it reaches its minimum in at most $N$ evaluations, $N$ ~ 10 say. Assuming a model of the form paraboloid + noise,
$ \qquad \text{f}( x; x_{min}, a, c, \sigma ) \approx a (x - x_{min})^2 + c + \mathcal{N}( 0, \sigma^2 ) $

what's an algorithm to generate points $x_0 \dots x_{N-1}$ to sample f() at, for a best estimate of $x_{min}$ ? (My real problem is 2d, but it may be helpful to do 1d first, so $ x\ x_{min}\ a\ c\ \sigma $ are 1d or 2d as appropriate.)

After sampling f() at the 4 corners and the middle, one can alternate:

fit a paraboloid to the data so far $\rightarrow x_{min}$
generate the next sample point $x_i$ ... how ?

Heuristics come to mind, e.g. "near $x_{min}$ but not too near", but this is surely a well-known problem ? (I imagine the cases $\sigma$ known and $\sigma$ unknown will be different.)

Added 11 Feb: In grid search, one typically evaluates f() 5 times at each point (5-fold cross-validation), on a say 10 x 10 grid, for 500 evaluations in all. The 5 values would give an estimate of $\sigma$ at each point, but then what — how would a statistician combine the 5 x 100 values ?

This question on SVM grid search shows 12 different bumpy not-so-paraboloids, but does not show the spread at each point. Also, search grid+search on stats.stackexchange.

This problem looks over-parameterized in 1D and under-parameterized in higher dimensions. In 1D, $a(x-x_\text{min})^2 + c$ suffices. (That's three parameters: $a, x_\text{min}, c$.) In 2D, use $(x-x_\text{min})^t a (x-x_\text{min}) + c$ for a symmetric positive-definite matrix $a$. (That's six parameters: three for $a$, two for $x_\text{min}$, and one for $c$.) — whuber, Feb 10 '12 at 16:39
If you take two samples at precisely the same point, will you get the same value of f() or different values due to the error term? — onestop, Feb 10 '12 at 17:27
@onestop, that's a good, fundamental point. Should we ask a separate question on "least squares with multiple values at the same point" or is that well-known ? — denis, Feb 13 '12 at 12:08
Are you sampling one at a time, or do you have to specify them all ahead of time? In the latter case, I think you may be screwed. For a given $Z$, one could construct a bad case by moving the minimum very far away and making the bowl very shallow. — eric_kernfeld, Nov 27 '17 at 23:41

score 1 · Answer 1 · answered Nov 28 '17 at 00:00

One approach is to reparametrize your model, revealing a completely boring OLS regression. You can then find the covariance of the parameters as a function of $z$, convert it back to get the covariance of $x_{min}$, and optimize some function of it.

Reparameterizing

If $y_i = (x_i^T-x_{min})^Ta(x_i^T-x_{min}) + c + \epsilon_i$, then $y_i = z_i \beta $ for some $z=g(x)$ and $\beta = h(a, x_{min}, c)$.

Define $b \equiv -2ax_{min}$. The conversions are:

$$z = [x_1^2, 2x_1x_2, x_2^2, x_1, x_2, 1]$$ $$\beta = [a_{11}, a_{12}, a_{22}, b_1, b_2, c]$$. $$x_{min, 1} = \frac{ a_{22} b_1 - a_{21}b_2}{a_{22} a_{11} + a_{21}^2}$$ $$x_{min, 2} = \frac{-a_{21} b_1 + a_{11}b_2}{a_{22} a_{11} + a_{21}^2}$$

Finding the variance and covariance

For a typical OLS regression, the estimate of $\beta$ is $\hat \beta = (Z^TZ)^{-1}Z^TY$, and the covariance of $\hat \beta$ is $\sigma^2(Z^TZ)^{-1}$. An unbiased estimator for error variance $\sigma^2$ is $\frac{1}{n-6}||Y - Z(Z^TZ)^{-1}Z^TY||_2^2$. This answers one of OP's sub-questions: how would a statistician integrate their 500 samples to estimate $\sigma^2$?

We still need to quantify the uncertainty in the original parameters. Using the delta rule, the covariance of $x_{min}$ is approximately proportional to $\phi^T(Z^TZ)^{-1}\phi$, where $\phi$ is the Jacobian of $x_{min}$ w.r.t. $\beta$ evaluated at the true $\beta$. This Jacobian's components are:

yucky but calculable, for $a$
elements of $a^{-1}$, for $b$
zeroes, for $c$.

This allows you to quantify uncertainty in $x_{min}$ for a given dataset or to choose good $x_i$'s for a given set of parameters.

When the parameters are completely unknown

One reasonable approach in this situation is to minimize the maximum penalty you could incur. If the penalty you want to minimize is the norm of the covariance of $x_{min}$, i.e. $||\phi^T(Z^TZ)^{-1}\phi||_F$, then there is no minimum. It decreases as long as $Z^TZ$ gets bigger. At this point, I wonder if there are other considerations to take into account -- do you really want $z_i$'s at each corner of the known universe? What if $f$ is not ... quite ... a parabola?

Find the minimum of an expensive-to-sample noisy paraboloid

1 Answers1

Reparameterizing

Finding the variance and covariance

When the parameters are completely unknown