22

Let $X$ be an $\mathcal{X}$ valued random variable. Suppose that we have observed $X = x$ . We use a parametric model with $\theta \in \Theta$ being the parameter .

In frequentist approach, we believe that there exists a true $\theta \in \Theta$ which decides the distribution of $X$. The inference procedure is that we use $x$ to estimate this true $\theta$.

In Bayesian approach, we impose a prior probability measure $\Pi$ on $\Theta$. The practical inference procedure is that we find a statistical model (a collection of probability measures $\mathcal{P}$ which $P_{X \mid \theta = \theta}$ takes value in) along with prior $\Pi$ reflecting the believes of the client, update $\Pi$ using $x$, and make conclusions according to the posterior distribution. There is literally no other way of doing the job.

My question is that, when we are adopting the Bayesian framework, after choosing the statistical model $\mathcal{P}$, do we believe in the existence of a true distribution of $\theta$, $P_{\theta}$ (such that we view posterior as an estimation of this true $P_{\theta}$)? I think that, even if we do believe in it, it probably will not impact how we carry out our work practically. But it seems that, once we believe in it, the Bayesian approach completely includes the frequentist approach (let $P_{\theta}$ concentrate on a singleton).

温泽海
  • 369
  • 4
    If the prior distribution represents your belief, then a prior concentrated at a single point would suggest that no evidence would persuade you to change your belief. This is not equivalent to the frequentist approach, which take the true value as being unknown. It also conflicts with what Lindley called Cromwell's rule – Henry Mar 24 '24 at 23:18
  • 2
    The use of "we" in this question looks strange to me - how can I know what you believe? It is even possible to do a frequentist analysis without "believing" that there is a true model or a true parameter, see https://arxiv.org/abs/2007.05748 – Christian Hennig Mar 25 '24 at 07:20
  • 1
    More an opinion than an answer - the virtue of Bayesian statistics is that we disavow ourselves of declaring anything to be "truth" per se. Truth could be conveniently defined as the limit of $n \rightarrow \infty$ in a Bayesian experiment, so that a Bayesian WLLN applies, but as many opponents of frequentism like to point out, this is a thought experiment and not pragmatic. Therefore we quantify beliefs with probability models on the assumption that they are convenient and mutually acceptable for scientific dialogue. – AdamO Mar 25 '24 at 18:41

4 Answers4

28

Do not confuse a priori knowledge of the true parameter value with having a true prior over the parameter space

The answer to this question depends on the underlying philosophical interpretation of probability that one takes within the Bayesian paradigm. The most popular (and in my view coherent) philosophical interpretation is the "epistemological interpretation" of probability.$^\dagger$ This philosophical approach takes "probability" to be a tool used to quantify our uncertainty about unknown things, subject to a set of coherence requirements (see Bernardo and Smith (1994), Ch 2-3 for discussion).

Broadly speaking, a probability measure arises as an effective tool for the quantification of uncertainty if we wish to measure uncertainty using real numbers (i.e., using a continuum as a measure of certainty) and we wish to avoid certain kinds of "incoherence". This approach was famously put forward in Ramsay (1931) and has been discussed at length by subsequent authors (see e.g., Kyburg 1978, Kennedy and Chihara 1979, Skyrms 1987, Christensen 1991, Skyrms 1992, Zynda 1996, Welch 2014, Roche and Schippers 2014). These arguments typically require the user to quantify their uncertainty in a way that avoids "Dutch book" outcomes over possible bets on states, which is argued to be a type of incoherence of belief. Setting aside the particulars of the coherence argument, what is most important here is that probability is viewed as a tool created by humans to analyse the world, not an inherent part of the world itself. In particular, the epistemic interpretation does not assume that there is any metaphysical analogue to probability in nature (e.g., aleatory probabilities of events), and so it is compatible with both deterministic and non-deterministic views of nature.

If you adopt the epistemic interpretation of probability as the foundation for Bayesian statistics then a "true" prior is simply one that properly represents your epistemic approach to uncertainty about a quantity at issue given the information you start with (i.e., prior to seeing the data that is the subject of the analysis). The prior is true if it properly represents your beliefs (subject to the stipulated coherence requirements) prior to seeing the data and it is false if it doesn't. Likewise, if you genuinely believe that the likelihood function in your analysis captures the nature of the observable data, then the posterior formed by updating your true prior belief will be true posterior. The posterior is true if it properly represents your beliefs (subject to the stipulated coherence requirements) after seeing the data and it is false if it doesn't. Within the epistemological paradigm, one must decide whether the true prior is elicited by genuine subjective beliefs (e.g., via subjective a priori views of break-even betting odds), or by some objective approach (e.g., reference priors), etc., to decide what constitutes a "true" representation of your beliefs; consideration of the relevant literature and methods should determine your beliefs as to how to view a priori uncertainty, so you will need to decide what approach you think properly encapsulates your "true" prior.

Now, even though the "true" prior/posterior may be determined by the epistemic approach, in most circumstances a Bayesian will still believe that there is some true long-run behaviour that establishes the true parameter value in a model (see this related question). This belief typically arises from an underlying belief in an infinite sequence of potential observations from the experiment under study, and the true parameters generally correspond to various limiting functions of the stipulated sequence. Thus, there will typically be a true (but unknown) parameter, a true prior, and a true posterior, and all of these will be different.

You get yourself into trouble when you then say that the Bayesian approach completely includes the frequentist approach simply because you could take the prior distribution to be a point-mass distribution on a known point. Firstly, that would still not make Bayesian analysis equivalent to frequentist analysis (there are several other differences in these approaches), but more importantly, is that actually your true prior belief? If you are certain a priori that the parameter is equal to the stipulated value then that is indeed your true prior (and you might get into this situation) but if you are not certain of that parameter value then that is not a true prior in the epistemological sense.


$^\dagger$ This is often called the "subjective interpretation" which is a bad name for it, since it doesn't necessitate subjectivity of the prior.

Ben
  • 124,856
14

I'll comment on both the idea of a "true parameter/parametric model" and a "true prior" as the question is somewhat ambiguous about which one of these is of interest here.

First regarding frequentism, it is true that analysing data based on parametric frequentist modelling will assume that there is a true model and a parameter value, which is what frequentist inference makes statements about.

This however does not mean that anybody who uses such methods needs to believe that these models and any parameter value are really true in reality. We use methods that are justified by and derived from artificial formal models, which are always an idealisation of reality and should therefore not be called "true" in reality. The fiction of a true parameter allows us to analyse mathematically the characteristics of our inference, and this is pretty much the best justification for such methods that we can get, for which reason we use such models, but this doesn't imply "belief". I think that any proper interpretation and discussion of results of (not only) frequentist inference need to acknowledge the fact that the models are justified within the "mathematical world", which is different from the real world.

Much of what I write here (above as well as below) has been elaborated in Hennig, C. (2023). Probability Models in Statistical Data Analysis: Uses, Interpretations, Frequentism-as-Model. In: Sriraman, B. (eds) Handbook of the History and Philosophy of Mathematical Practice. Springer, Cham. https://doi.org/10.1007/978-3-030-19071-2_105-1; free on https://arxiv.org/abs/2007.05748

Regarding the Bayesian approach, @Ben has given a good answer. Note that there is more than one interpretation of Bayesian probabilities though. De Finetti for example is very explicit on not believing in true models and parameters. According to him the parametric model is only a device to derive meaningful predictive posterior distributions. In de Finetti's sense one can interpret the posterior regarding expected future observations, but not regarding a true parameter value, as this doesn't exist. A "true prior" in this sense would be a prior that correctly expresses your personal uncertainty (or, in "objective Bayes", the uncertainty based on secured "objective" knowledge).

It has been argued though (e.g., in D. Gillies' "Philosophical Theories of Probability"; similar arguments made in Diaconis & Skyrms's "Ten Great Ideas about Chance") that having your belief modelled by a prior based on an exchangeability assumption and a parametric sampling model implies the belief that if infinitely many observations could be collected, they would in fact behave like the sampling model with a certain true parameter value, and in this sense one could correctly state that if your belief is modelled in this standard Bayesian way, you implicitly also believe in a true parameter in the sense defined above. Diaconis & Skyrms (and some other Bayesians) indeed hold that in this way the Bayesian approach actually includes frequentism, but as @Ben correctly notes, there are other differences between these schools. In particular, frequentist inference is about performance characteristics of methods given the true parameter whereas Bayesian inference is about making probability statements about that parameter and future observations (in my paper cited above I call this "compatibility logic" - frequentist inference is about whether models are compatible with the data, not about whether they are true - vs. "inverse probability logic").

Furthermore, I think that Bayesian epistemic/subjective probability is just as much an idealisation as frequentist probability is. In particular, nobody would normally really believe in exchangeability, as this does not only imply that the order of observations is irrelevant, but also that it can never be learnt, taking into account the order of observations, that in fact the order is relevant, differently from what was initially assumed (meaning that "believing" in the irrelevance of the order is not enough, you have to be 100% sure about it). So the above argument doesn't really hold as exchangeability is assumed for convenience and for having a well defined way of how to learn from the past for the future, but not because anybody would believe that it is actually true. Also a "true prior", if it even exists, may not agree with the one used for statistical analysis (for example by not assuming exchangeability).

Another aspect is that the probabilities used in Bayesian inference can also be understood in an empirical, frequentist way. In this case the sampling model is interpreted as frequentist (as said above this does not necessarily mean that we have to believe in it, however we analyse the situation as if it were true), and also the prior could refer to a frequentist distribution over true parameters in similar studies. This is advocated in several places by Andrew Gelman, also see my paper above. A key problem with this is that in order to define what the "true prior" would be, a precise definition is required what is the "reference set" of studies that qualify to be included in the population on which the prior is based. This is hardly ever given and probably very hard to specify.

A final aspect is that it can also in some situations be argued that although the parametric model is an idealisation and not literally "true", the parameter refers to something that really exists (like the quantity of certain pollution in a river, measured with uncertainty). In this way one could justify the existence of a "true parameter" without holding the model as "true" (and a prior to formalise uncertainty about that true parameter), although of course this requires to connect the "true" parameter with the model within which it is mathematically defined, which may be "philosophically hard" without assuming the model to be true as well.

4

do we believe in the existence of a true distribution of $\theta$, $P_{\theta}$ (such that we view posterior as an estimation of this true $P_{\theta}$)

  1. A posterior distribution does not aim to estimate a 'true prior' but instead a 'true parameter'.

    For example, the fat percentage of a person is a specific fixed value (if we neglect small temporal weight variations like respiration, sweating etc.), but it follows a certain distribution in a population.

    • When we 'randomly select' a person from a population, then we may consider the fat percentage of that person as like being drawn from a probability distribution that is the distribution of fat percentages in the population.
    • But once selected the value is considered fixed.

    After we make some measurements like skin fold thickness, or electric conductance of the body, then we try to estimate the fat percentage of the person and not the fat percentage distribution of the population that the person came from.

  2. The idea that the parameter that we estimate comes from some distribution, like with the fat distribution among the population, may exist, but this is not always the case. We may for instance estimate a physical constant instead (although possibly to be considered as technically still a distribution, namely a degenerate distribution).

    The prior is not the same as this distribution. Or this distribution is not to be called a prior.

    Using a prior can be done, for example, to improve estimates, like in LASSO or ridge regression. It is a different 'concept' from something that is an estimate or guess of the true population distribution (which in the case of constants doesn't even exist).

  • One minor comment: The lasso and other methods using shrinkage priors, when used with high dimensional candidate predictors, can be thought of as assuming a population distribution of effects of those predictors. So I think you can envision such a prior in the high-dimensional case as having both an individual effect as well as population of effect interpretations. – Frank Harrell Mar 25 '24 at 11:39
  • 1
    @FrankHarrell the point of my example with lasso and ridge was that the usage of a prior has more functions than just mimicking some true population distribution. Using the name of 'true prior' for some true underlying population distribution in a data generating process, might be something like a misnomer. (something that I still intend to work out in Can the "true" prior lead to better posterior estimation?) – Sextus Empiricus Mar 25 '24 at 12:35
3

The question is interesting, albeit somewhat ill-posed. Bayesians are generally comfortable with the idea of some point $\theta_0$ in parameter space $\Theta$ being the true parameter of a given parametric model $p_{X|\theta}$.

Your prior probability $\pi(\theta)$ over $\Theta$ then describes your confidence regarding $\theta_0$'s location, and with every new piece of information $x_{1}, x_{2}, \ldots, x_{T}$ that you update your prior with, $\pi(\theta|x_{1}, x_{2}, \ldots, x_{T})$ becomes narrower and narrower until it concentrates over $\theta_{0}$.

But the concentration of $\pi(\theta|\cdot)$ stands at the end of this process, not the beginning. In other words, we should be talking about the posterior concentrating over the "truth", not the prior (for a seminal paper discussing posterior concentration, see Ghosal et al.).

In principle, you could have a point prior $\pi(\theta) = \delta(\theta-\theta_0)$ (where $\delta$ is a zero-variance Gaussian). But in that case any further Bayesian updating is pointless: you already know the "true" $\theta$ with absolute certainty and every $\theta \neq \theta_{0}$ has zero probability mass, which no amount of data will ever "undo".

Durden
  • 1,171
  • I have edited the problem to clarify what I actually mean. Sorry for the confusion. – 温泽海 Mar 24 '24 at 23:52
  • Thank you, that made the question a bit clearer. As for the "true prior", I second @Ben's answer the prior that contains all your current knowledge about $\theta$ as the "true" one. – Durden Mar 25 '24 at 14:37