2

Say we are fitting a parametric model $y(x, \theta)$ to some data (e.g. logistic regression). Given a prior distribution over the model parameters $\theta$ and observed data $x$, we arrive at a posterior distribution for each of the parameters as well as predictive posterior distribution at any new point.

Now suppose that at a given input point, we wish to calculate the mean model prediction. We have two options:

  1. We can sample the predictive posterior distribution at the point and find it's mean
  2. We can take the model value defined by the mean posterior values of the parameters

Are these two options guaranteed to give the same results? I suspect not, and if that's the case, which is more correct to use?

For example, in a logistic regression scenario, we have a model for $p$: $$p = (1 + \exp(a+bx))^{-1}$$

Once we have a posterior, and some sample point $x_0$ we could simply take use the logistic model given at the mean parameters $\bar a$ and $\bar b$ (option 1), or find the mean posterior of $p$ at point $x_0$ (option 2). Since the mean isn't invariant to re-parameterization, I can imagine these two numbers being different, though intuitively these both seem to legitimate approaches to answering the same question.

1 Answers1

1

It depends what is the functional form of the model and if the parameters are dependent or not. For example, by the linearity of expectation

$$ E[X+Y] = E[X] + E[Y] $$

Additionally, if they are independent, it holds that

$$ E[XY] = E[X] E[Y] $$

But in general, there's Jensen’s inequality (see examples) that tells us

$$ g(E[X]) \le E[g(X)] $$

So expected value of a function is not necessarily the same as the function of the expected value. In some simple cases, when the parameters are independent, the second approach could give the same results, but not in general.

Using your example, what you want to estimate is

$$ E[y|x] = \int_{a,b} (1 + \exp(a+bx))^{-1} \,f(a,b)\, da \,db $$

that can be approximated with Monte Carlo sampling by drawing $R$ samples from the distribution of the parameters $f$ and averaging

$$ E[y|x] \approx \frac{1}{R}\, \sum_{i=1}^R (1 + \exp(a_i+b_i x))^{-1} $$

That is not the same as

$$ \ge \Big(1 + \exp\Big(\Big(\frac{1}{R}\, \sum_{i=1}^R a_i\Big)+\Big(\frac{1}{R}\, \sum_{i=1}^R b_i\Big) x\Big)\Big)^{-1} $$

To convince yourself, try it on actual data: using some model sample both the parameters and the predictions and calculate the results both ways.

Why does looking at predictions of expected values of parameters doesn’t necessarily make sense? It would be easier to explain with another example, say that instead of averaging parameters you are averaging the data. An “average human” that you would make predictions for would be an Asian male, with a fraction of a wife, less than two legs, living somewhere in the middle of the ocean, etc. Same with the parameters, the combination of their marginal averages doesn’t have to lead to any meaningful scenario to consider.

Tim
  • 138,066
  • Right, so now the question is which to use. I'd think the second option is "more" correct, but it has the weird side effect that the final mean model chosen doesn't necessarily have the same form as any of the parametric models! – Nathaniel Bubis Apr 02 '22 at 16:34
  • @nbubis The second approach is incorrect, for the reason given above. – Tim Apr 02 '22 at 16:38
  • I don't see how this is incorrect - the mean of the posterior value of the model is not the same as the model value with mean parameters, but why is this incorrect? – Nathaniel Bubis Apr 02 '22 at 16:39
  • @nbubis Because we know that the expected value of the function and function of the expected value are not the same, e.g. by Jensen’s inequality. – Tim Apr 02 '22 at 16:53
  • Again, the fact that these are not the same doesn't explain why the first is correct and not the second. If one is interested in the mean value at a certain point, it seems natural to sample the posterior at that point, rather than the model of the posterior parameters. – Nathaniel Bubis Apr 02 '22 at 16:57
  • 1
    @nbubis because it doesn’t tell you about expected value of the model, only about point estimate at some fixed values that do not necessarily make any sense. – Tim Apr 02 '22 at 17:10
  • The fact that the expected value of a variable may be nonsensical is true, but irrelevant. Take a model with discrete parameters (e.g number of legs). Option 1 says take the model with mean parameter values (which are nonsensical). Option 2 says take the mean of the model averaged over the posterior parameter values. – Nathaniel Bubis Apr 02 '22 at 17:37