3

Suppose the true dgp is $$ x_i \sim d_1(\theta_1), \quad i=1,\ldots,N $$ where $d_1$ is some probability distribution with parameter(s) $\theta_1$, but I wrongly assume $$ x_i \sim d_2(\theta_2). $$ Now suppose I do (numerical) maximum likelihood estimation of $d_2$. I obviously cannot recover $\theta_2$ consistently, since the distributional assumption is wrong, but suppose my ML estimator converges to something constant $$ \lim_{N \rightarrow \infty} \hat{\theta}_2 = c \neq \theta_2. $$ Now my question is, does it hold that (for all possible probability distributions) $$ p_{d_1}(\theta_1) \geq p_{d_2}(c), $$ where $p$ denote the respective probability density functions?

stollenm
  • 842

1 Answers1

2

I am going to slightly rephrase your question: we assume you have $N$ samples $\{x_i\}_{1 \leq i \leq N}$ which were generated from a ground-truth model $d_1$ with parameters $\theta_1 \in \Theta_1$ (where $\Theta_1$ is the set of possible parameters for $d_1$).

You know neither the ground-truth model $d_1$ nor its parameters $\theta_1$. You are going to fit a model $d_2$ (which is different from the ground-truth $d_1$: "all models are wrong") and estimate its parameters $\theta_2 \in \Theta_2$, for instance via maximum likelihood estimation:

$$ \hat{\theta}_2 = argmax_{\theta_2} \ \ p(\{x_i\}|\theta_2,d_2) $$

If you were to know the ground-truth $d_1$ and $\theta_1$, would you necessarily have $p(\{x_i\}|\theta_1,d_1) \geq p(\{x_i\}|\hat{\theta}_2,d_2)$ (i.e. a higher model evidence for the ground truth model)? Well, no. There is no formal and systematic link between the model evidences for $d_1$ and $d_2$, since their ratio is going to depend on:

  1. Their relative complexities (i.e. number of free parameters, as measured by $|\Theta_1|$ and $|\Theta_2|$). If $|\Theta_2| < |\Theta_1|$, i.e. if $d_2$ is simpler than $d_1$, it might not be able to explain the observations, and hence have a low likelihood. However, if $|\Theta_1| < |\Theta_2|$, then the evidence for model $d_2$ will be penalized by its higher number of free parameters. This is nicely explained in chapter 28 of the following textbook:

MacKay, D. J., & Mac Kay, D. J. (2003). Information theory, inference and learning algorithms. Cambridge university press.

  1. The observations $\{x_i\}_{1 \leq i \leq N}$. If $N$ is small, or if the set $\{x_i\}$ is an outlier that does not represent the average output of $d_1$, then the model evidence for $d_1$ will be small. This is a case of non-identifiability, which we discuss in the following paper:

Gontier, C., & Pfister, J. P. (2020). Identifiability of a binomial synapse. Frontiers in computational neuroscience, 14, 558477.

I also proposed a solution for the case were $|\Theta_1| < |\Theta_2|$ in the following question: Formal proof of Occam's razor for nested models

Camille Gontier
  • 2,616
  • 6
  • 13
  • Thank you very much for your answer. It gets me a lot closer to understanding my situation. Your second point is exactly what I was trying to avoid by thinking about the situation in the limit. The first one is in helpful to my question. I should have specified that I assume both have the same number of parameters. – stollenm Oct 02 '22 at 06:24