Why do we use Acquisition Functions?

Question

In the context of Surrogate Modelling and Bayesian Optimization - Acquisition Functions (https://tune.tidymodels.org/articles/acquisition_functions.html) are often used as a "compass" to suggest which point to evaluate the objective function at next.

In short, the objective function that you are trying to optimize is modelled using some (simpler) surrogate model (e.g. the surface of the objective function can be modelled using a Gaussian Process), and a separate Acquisition Function "suggests" how to navigate the surface of the Gaussian Process at each iteration. This "feedback loop" is repeated iteratively until some condition is met (e.g. convergence is met or some budget constraints are exhausted). Note: In reality, the Acquisition Function itself must be optimized at each iteration using some gradient based algorithm (e.g. BFGS), but it is said that this sub-optimization problem is not as difficult as the main optimization problem.

I am trying to understand why an Acquisition Function is necessary in this above procedure. I have heard the following argument being made for why an Acquisition Function is required : In many applications where we want to use Bayesian Optimization, we only have realizations from some partially observable objective function. This means that the Gaussian Process is not very "informative" and trying to optimize the Gaussian Process directly will also not be very informative. This is why an Acquisition Function should be used - and somehow, the use of an Acquisition Function helps circumvent this "un-informativeness problem".

However, I do not fully understand this argument. If the Gaussian Process itself is uninformative, how can the use of an Acquisition Function remedy this problem?

My Question: Why is an Acquisition Function required in Bayesian Optimization - after all, why can't we directly optimize the Gaussian Process without taking into consideration the advice from the Acquisition Function?

Thanks!

score 4 · Answer 1 · answered Apr 28 '22 at 22:14

In uncertain scenarios, where Bayesian optimization is employed, we are facing the exploration-exploitation tradeoff. The acquisition function is a solution for it. As you noticed, Bayesian optimization only approximates the target function. The approximation is imperfect, so we don't want to optimize it directly. Models like the Gaussian process are designed to do well at learning the uncertainties (covariances). Now recall how acquisition functions work, for example, the upper confidence bound (UCB) looks at values that are high in terms of mean $\mu$ and uncertainty $\sigma$, $\mu + \beta\sigma$, so you explore the areas where you are the most uncertain until the uncertainty is reduced enough that it doesn't play important role in the result. The same applies to other acquisition functions like expected improvement or probability of improvement. Thompson sampling overcomes this problem by sampling from the posterior distribution, so by randomizing the process according to the posterior probabilities. So acquisition function forces you to explore more rather than focusing on pure exploitation.

@ Tim: Thank you so much for your answer! If you have time, can you please take a look at this question over here? https://stats.stackexchange.com/questions/573809/why-are-gaussian-processes-used-in-bayesian-optimization thanks! — stats_noob, May 03 '22 at 13:14

Sycorax · Answer 2 · 2022-05-17T13:36:34.070

The purpose of Bayesian optimization is to find global minima of functions that have many local minima. More typical optimizers are "local," in the sense that they follow some procedure until they find the gradient is zero, and then stop, whether or not there is an even lower value elsewhere.

Almost immediately, this simple statement of the problem exposes the core tension that Bayesian optimization is designed to navigate:

We might exploit our current best estimate to find a better function value nearby our current "best estimate" of the lowest value;
alternatively, we might explore a region far away from what we've already visited to find a better value -- but these locations are the ones where we have the least information.

This is really no different than deciding what to make for dinner. You could make the same meal you made last night, and it would probably be about as enjoyable. Alternatively, you could experiment and make a new meal, but that's a gamble. It could be better, or it could be worse. If you want to optimize your enjoyment of dinner, you're immediately confronted with a choice about whether you want to do something reliable or take the chance that you might be able to make something better (but, by the same token, it might be worse).

Gaussian processes are flexible in that they can exactly interpolate the observed data, but they also reflect increasing uncertainty about the function value as you move away from the observed values. (GPs are a prior over functions, so the further you move from observed data, the more the behavior becomes dominated by the prior.) If we don't ever explore areas that are far away from our current best estimate, then it's possible that we're skipping over the optimal portion of the space.

Moreover, the surrogate function we estimate using the GP and the observed data is not going to be a perfect representation of the true function under optimization. Especially early in the optimization procedure, the minimum identified in the surrogate is unlikely to correspond to the minimum of the true function.

The purpose of the acquisition function is to assign a numerical value that will govern the tradeoff between exploration and exploitation. We want that numerical value to both incorporate the local information about our estimates of the function values, and our uncertainty about those estimates. The acquisition function tells us which function inputs are the most valuable to visit. Because acquisition functions are designed to be cheap to compute and reflect the uncertainty of the surrogate model's estimates, it summarizes the value and uncertainty estimates from the GP into a single value.

Since your bounty asks for an authoritative source, here's a quote from a peer-reviewed publication:

Using the Gaussian process model, an acquisition function is constructed to represent the most promising setting for the next experiment. Acquisition functions are mainly derived from the $\mu(x)$ and $\sigma(x)$ of the GP model, and are hence cheap to compute. The acquisition function allows a balance between exploitation (sampling where the objective mean $\mu(\cdot)$ is high) and exploration (sampling where the uncertainty $\sigma(\cdot)$ is high), and its global maximizer is used as the next experimental setting.

from S. Greenhill, S. Rana, S. Gupta, P. Vellanki and S. Venkatesh, "Bayesian Optimization for Adaptive Experimental Design: A Review," in IEEE Access, vol. 8, pp. 13937-13948, 2020, doi: 10.1109/ACCESS.2020.2966228.

The whole article is very accessible and worth reading if you're interested in Bayesian optimization.

Why do we use Acquisition Functions?

2 Answers2

Linked