Should we really search for the model for which the probability of the data is maximal?

Question

I try to figure out what a Bayesian approach to Machine Learning is.

I start from a model that for any given features-vector and target calculate probability density:

$ P (y, x_1, x_2, \dots, x_n, c_1, c_2, \dots, c_k) $

where $y$ is the observed target, $x_i$ are $n$ features and $c_j$ are $k$ model parameters.

I try to find good values of model parameters.

For given data set (pairs of features-vectors and corresponding targets) and for given values of model parameters I can calculate probability (density) of the observed targets. So, in fact, I can get probability of data given model. What I need in the end is probability of model given data. I can get this probability by using Bayesian formula if I know prior distribution of models:

$ \pi (c_1, c_2, ..., c_k) $

To keep it simple my prior is uniform. I say that all parameters can be in the range from 0 to 1 and, within this restriction, all values of parameters vector are equally probable. For example (0.2, 0.1, 0.3) is as probable as (0.7, 0.5, 0.4) or any other triplet of real numbers.

In this case finding the most probable model given the data is equivalent to finding the model for which the probability of the data is the highest.

However, I believe that finding the most probable model given the data is a wrong thing. I can imagine that in the parameters space there is a point that demonstrates a very large probability (density) of data given model (or model given data) but it is very high in a very narrow range.

So, we could have a situation that for some model parameters the probability of the data is very high but it is very improbable that model parameters are located in such a narrow range.

So, what would be a Bayesian approach to this problem?

I hope that there is a model driven solution rather than some ad hoc work-around. — Roman, Jun 08 '22 at 11:44
what you are describing is MAP. A bayesian approach is essentially averaging (appropriately) over the models.- what in ML is called ensemble approaches. — seanv507, Jun 08 '22 at 11:57
@seanv507 doesn't MAP just maximize a posteriori probability? If it is the case, it does not resolve the problem that I mention (the global maximum can be very narrow). — Roman, Jun 08 '22 at 12:07
I am not disagreeing - I am just saying that the thing you are criticising is called MAP, whereas Bayesians average. — seanv507, Jun 08 '22 at 12:36
MAP is not a typical Bayesian solution as it is not the result of a loss optimisation and depends on the parameterisation of the model. — Xi'an, Jun 10 '22 at 08:15

Sextus Empiricus · Answer 1 · 2022-06-10T09:00:41.907

The disadvantage of the maximum a posteriori (MAP) estimator that you mention can be illustrated with a bi-modal distribution like the following

The maximum is in the point 2, but there is is not a lot of probability mass in the neighborhood.

An alternative is to optimize a cost function. For instance with a square loss you would end up with the posterior mean.

^{Edit: Xi'an mentioned in the comments}

^{MAP is not a typical Bayesian solution as it is not the result of a loss optimisation and depends on the parameterisation of the model.}

^{That is another disadvantage besides the one that you mentioned.}

^{A probability distribution will change depending on how we express the variable. The distribution of $X$ may have a different maximum than the distribution of $X^2$. This means that changing the way that you express a variable, e.g. whether you use volume of length, can have an influence on the estimate.}

(+1) Indeed MAP is not loss induced! – Xi'an Jun 10 '22 at 08:16 — Xi'an, Jun 10 '22 at 08:16

Should we really search for the model for which the probability of the data is maximal?

1 Answers1

Linked