19

As someone who started by studying classical statistics where the Central Limit theorem is key to making inferences, and only later now am studying Bayesian statistics, I was late to realize that the Central Limit Theorem has a much smaller role to play in Bayesian statistics. Does the Central Limit Theorem play any role at all in Bayesian inference/statistics?

Later Addition: From Bayesian Data Analysis by Gelman et.al. 3rd edition - on the central limit in the Bayesian context. "This result is often used to justify approximating the posterior distribution with a normal distribution" (page 35). I went through a graduate course in Bayesian Statistics without encountering an example in which the posterior was approximated with a normal distribution. Under what circumstances is it useful to approximate the posterior with a normal distribution?

  • 3
    Sextus answer to this question seems relevant https://stats.stackexchange.com/questions/570503/how-would-a-bayesian-estimate-a-mean-from-a-large-sample – Manuel Jan 10 '23 at 13:18
  • 5
    there are Bayesian versions of central limit theorems, but they play a fundamentally different role because Bayesians (in broad terms) don't need asymptotics to produce inference quantities; rather, they use simulation to get "exact" (i.e. up to numerical error) posterior quantities. There's no need to lean on asymptotics to justify a credible interval, as one would to justify a confidence interval based on the hessian of the likelihood, say. – John Madden Jan 10 '23 at 13:52
  • 1
    @JohnMadden, that's a legit point. In fact, a coherent answer can be made out of it, imo. – User1865345 Jan 10 '23 at 14:14
  • @User1865345 no promises on its coherency but you have bullied me into an answer ;) – John Madden Jan 10 '23 at 16:39
  • The central limit theorem is central to any field of statistics. It describes the tendency of sums of variables to approach a normal distribution and that is independent from how you would wish to analyse the variables, whether it is frequentist or Bayesian. – Sextus Empiricus Jan 10 '23 at 17:14
  • @SextusEmpiricus: I would love to see that perspective flushed out in an answer. I asked the question because I didn't think there is enough to argue that the Central Limit Theorem plays much of a role in Bayesian Statistics. Sure, when MCMC is too computationally expensive, we can approximate the posterior to simplify our life, but I would speculate that this is rarely done. Bayesian Statisticians have plenty of computational tricks to solve any computational bottleneck. I would love to be wrong about this. This is why I asked the question. – ColorStatistics Jan 10 '23 at 17:22
  • 1
    @ColorStatistics I am still pondering a bit over this question. For instance, I am not a fan of dividing statisticians into Bayesian and frequentist statisticians, as if this is a property of the statistician rather than the technique. And regarding the central limit theorem, I believe that this is such a general principle that the question is like asking "what is the role of x in Bayesian statistics" where we can fill in for x something basic and general like 'integration', 'binomial distribution', 'cloud computation ', etc. – Sextus Empiricus Jan 10 '23 at 17:50
  • @SextusEmpiricus: feel free to answer the question paraphrased as “Why the Central Limit Theorem plays an important role in Bayesian inference?” My take is that the answer to this question would be necessarily very short. – ColorStatistics Jan 10 '23 at 18:03
  • Looks like you are the only one of that opinion, but sure go ahead and close it – ColorStatistics Jan 10 '23 at 18:07
  • 2
    Check the keyword Bernstein-von Mises as the Bayesian version of the CLT. There is a large and current literature on the topic. – Xi'an Jan 11 '23 at 10:14

3 Answers3

17

The Frequentist needs asymptotics because the things they are interested in, like intervals which cover the true value 95% of the time or tests which have a false positive rate of less than 5% when the null hypothesis is true, typically do not exist. If the model is linear and the errors Gaussian, we can get exact confidence intervals, but rarely otherwise. However, we can build intervals which cover the truth asymptotically in very broad classes of models by exploiting a quadratic approximation of the likelihood.

The Bayesian does not have this problem. Given a prior and posterior, the 95% credible interval is a very well defined concept: any interval which contains 95% of the posterior mass. Likewise, Bayes factors can be defined in terms of posterior quantities. Life is easier in the linear/Gaussian case because these quantities will be available in closed form. But even in the general case, we can precisely define these quantities mathematically, and thus use the tools of numerical analysis to compute approximations. Most prominent would be Markov-chain Monte Carlo. The Bayesian, given infinite computing power, can thus get arbitrarily close to "correct" credible intervals/posterior means/etc for any sample size and any model.

[Of course, if the prior is not good, these quantities are utterly meaningless. Even if it is, they do not have any guarantee of relating to anything in the "real world" like a frequentist interval does; they are simply the results of "thinking rationally".]

You also ask about how Bayesians might avail themselves of the CLT. This comes in handy if the Bayesian doesn't have infinite computing power. MCMC is guaranteed to work eventually, but it might take too long on your computer. If the reason the posterior is expensive to evaluate is because you have a lot of data, we can deploy a normal approximation to the posterior. Various ways exist to choose the parameters of the approximating normal; perhaps the most popular is the Laplace Approximation, which uses a quadratic approximation of the posterior near its mode (this might remind you of frequentist asymptotics).

John Madden
  • 4,165
  • 2
  • 20
  • 34
  • 2
    +1: I was not aware of the Laplace Approximation; thank you – ColorStatistics Jan 10 '23 at 17:12
  • 3
    +1. Coherent indeed. – User1865345 Jan 10 '23 at 17:13
  • Bayesian approaches rely as well on estimates of densities. It's just that with Bayesian approaches one often already uses estimates for sampling, like MCMC that you mention, such that a normal approximation is not necessary. – Sextus Empiricus Jan 10 '23 at 17:19
  • @SextusEmpiricus "Bayesian approaches rely as well on estimates of densities" can you expand on what you mean by this? – John Madden Jan 10 '23 at 17:22
  • 1
    When there is no closed form expression of the posterior density and an approximation is used (e.g. sampling). – Sextus Empiricus Jan 10 '23 at 17:23
  • 1
    @JohnMadden: Overall though, I think your answer supports the argument that the CLT is hardly critical to Bayesian Statistics; it seems to support the claim that you could teach Bayesian Statistics completely ignoring the CLT and focus on computational solutions and the students would not be deprived of much. Is this your position? – ColorStatistics Jan 10 '23 at 17:27
  • 2
    @SextusEmpiricus unfortunately I'm struggling to pin down which part of my answer is in conflict with your comments. – John Madden Jan 10 '23 at 17:31
  • 2
    @ColorStatistics I agree; and FWIW, my grad Bayes class didn't mention Bayesian CLTs either :) – John Madden Jan 10 '23 at 17:32
  • @JohnMadden: Thank you for that clarification; that was my experience as well; In the quote I used in the post, Andrew Gelman uses the word "often" to describe how frequently the CLT is used to approximate the posterior; I am surprised by the "often" qualifier – ColorStatistics Jan 10 '23 at 17:38
  • @ColorStatistics perhaps this "often" is "conditional on a normal approximation being used, the justification is often a Bayesian CLT" rather than simply "the Bayesian CLT is often used". – John Madden Jan 10 '23 at 17:57
  • @JohnMadden the conflict is with your first paragraph where you state in a way that frequentist methods need the central limit theorem or approximations with a normal distribution. The use of the CLT isn't any different between Bayesian and frequentist methods. It is just a way to approximate distributions of sums of variables. With a Bayesian method, anytime that a normal distribution is used (for instance as prior), then in a way this is the use of the CLT. – Sextus Empiricus Jan 10 '23 at 17:58
  • 2
    Erm, MCMC is not Bayesian per se, i.e., the CLT used in MCMC is not for inference purposes but for Monte Carlo convergence control. – Xi'an Jan 11 '23 at 10:16
  • @SextusEmpiricus fair enough; CLT not necessary for freq inference. But more necessary than in Bayes ? ;) – John Madden Jan 11 '23 at 13:49
  • 1
    @Xi'an re: "Not Bayesian per se": I tried to make this clear by my phrasing being "...use the tools of numerical analysis [such as] MCMC". (the implication being of course the tools of NA are not inherently Bayesian, and hence that MCMC isn't). Please let me know if you have better phrasing in mind. – John Madden Jan 11 '23 at 13:50
  • "CLT not necessary for freq inference. But more necessary than in Bayes?" CLT is a very general principle that is not specifically related to any form of inference. It is like asking whether compound distributions are more or less "neccesary" for Bayesian inference than for frequentist inference. The term 'neccesary' makes this a difficult question. Depending on the problem, it may not be neccesary at all, or even not remotely applicable. – Sextus Empiricus Jan 11 '23 at 14:07
  • "they do not have any guarantee of relating to anything in the "real world" like a frequentist interval does" - as frequentist models are only idealisations, the frequentist interval doesn't have such guarantees either (only if the model holds, which in reality it doesn't). – Christian Hennig Jan 11 '23 at 14:44
  • @SextusEmpiricus thanks for engaging in this conversation with me over the last two days, I appreciate it and I've learned from it. I hope you don't take it the wrong way if I suggest we both move on. I'm thinking differently about things as a result of our interaction :) – John Madden Jan 11 '23 at 15:23
  • 1
    @ChristianHennig this is a good point (and why I have quotes around "real world"), and it calls to mind a George Box quote that I've heard about 10 million too many times, but I don't think this should obfuscate the fact that there is still a distinction in the relationship between (perhaps we should call it) "idealized reality" and frequentist vs Bayesian quantities. – John Madden Jan 11 '23 at 15:26
  • 1
    @JohnMadden I agree... at least the frequentists try to model reality. If you have too much time in your hands;-), you may like to read this: https://arxiv.org/abs/2007.05748 – Christian Hennig Jan 11 '23 at 15:37
  • @ChristianHennig I absolutely shall (after the ICML deadline is passed ;) – John Madden Jan 12 '23 at 01:55
6

The limit theorem is 'central'

The central limit theorem (CLT) has a central role in all of statistics. That is why it is called central! It is not specific to frequentist or Bayesian statistics.

Note the early (and possibly first ever) use of the term 'central limit theorem' by George Pólya who used it in the article "Über den zentralen Grenzwertsatz der Wahrscheinlichkeitsrechnung und das Momentenproblem"

Das Auftreten der Gaußschen Wahrscheinlichkeitsdichte $e^{-x^2}$ bei wiederholten Versuchen, bei Messungsfehlern, die aus der Zusammensetsung von sehr vielen und sehr kleinen Elementarfehlern resultieren, bei Diffusionafurgängen usw. ist bekanntlich aus einem und demselben Grenzwertsatz zu erklären, der in der Wahrscheinlichkeitsrechnung ein zentralen Rolle spielt.

emphasis is mine.

The principle behind the limit is applied whenever we use a normal distribution

The CLT describes the tendency of sums of variables to approach a normal distribution and that is independent from how you would wish to analyse the variables, whether it is frequentist or Bayesian. Such sums occur anyware.

It is arguable that whenever a normal distribution is used, then it is indirectly an application of the central limit theorem. A normal distribution does not occur as an atomic distribution. There is nothing inherently normal distributed and when a normal distribution 'occurs' then it is always due to some process that sums several smaller variables (e.g like a Galton board where a ball is hitting multiple times a pin before ending up in a bin). And such sums can be approximated by a normal distribution.

The use of the normal distribution can have other motivations. For instance, it is the maximum entropy distribution for a given mean and variance. But in that case, it still indirectly relates to the CLT as we can see a maximum entropy distribution as arrising from many random operations that preserve some parameters (like in the case of the normal distribution, the mean and variance are preserved). When we add up many variables with a given mean and variance, then the resulting distribution is likely gonna be something with a high entropy, ie something close to the normal distribution.

The CLT is such a general principle that the question is like asking "what is the role of 'integration' in Bayesian statistics". Or fill in any other trivial process in place of CLT.

Practical application of CLT

It might be that in practice one observes a tendency for textbooks or statisticians/fields to often apply a particular technique, frequentist or Bayesian, and use relatively more or less often a normal approximation. But, that is in principle not related to those fields.

In practice a particular technique might be preferred. For instance when approximating intervals, then one can use a normal distribution as approximation, but that is not a neccesity. One can also use a Monte Carlo simulation to estimate the distribution or sometimes there is a formula for the exact distribution.

Possibly Bayesian approaches use the normal approximation less often because they are in a situation where they use Monte Carlo simulation/sampling already anyway (to find a solution for large intractable models).

It can be that in particular fields the models are too complex to apply a normal distribution approximation and that those fields also often apply Bayesian techniques. That doesn't make the role of the CLT is smaller for Bayesian techniques. At least not in principle.

There is a large amount of scientists that use nothing much more than simple things like ANOVA, chi-squared tests, ordinary least squares fits, or small variations of it. Those techniques happen to be frequentist and use a normal distribution approximation. Because of that it might seem like frequentist techniques often use the CLT but it doesn't rely on it in principle.


Related:

How would a bayesian estimate a mean from a large sample?

Would you say this is a trade off between frequentist and Bayesian stats?

  • The notion that any measure (such as a prior, or a random effect) that "averages a bunch of things up) strongly tends to normal is good intuition for a statistician to have in their backpocket. – AdamO Jan 10 '23 at 18:21
  • 1
    Thank you for this perspective. If I can try to boil down your answer, you're saying that any time we use a Normal prior in Bayesian Inference we're standing on the shoulders of the CLT. Makes sense. On one hand, much of classical/frequentist inference would not exist were it not for the CLT; the CLT is directly making the inference possible. On the other hand, in Bayesian Inference, only in the particular case that we assume the prior to be Normal, does the CLT come into play; and it does so very indirectly, reminding us that the Normal distribution phenomenon is inherently linked to the CLT. – ColorStatistics Jan 10 '23 at 18:35
  • @ColorStatistics "much of classical/frequentist inference would not exist were it not for the CLT" I disagree with this claim. The approximations with a normal distribution make life easier, but are not essential or neccesary. That's what I mentioned in my last sentences "For instance when approximating intervals, then one can use a normal distribution as approximation, but one can also use a Monte Carlo simulation to estimate the distribution". Many frequentist methods have an approach with a normal distribution approximation, and at the same time an 'exact method ' as well. – Sextus Empiricus Jan 10 '23 at 19:03
  • Thanks for this answer. Do you have any references you can recommend that deploy MC-based interval generation, or discuss this procedure theoretically? As alluded to in your comment. – John Madden Jan 10 '23 at 19:13
  • @JohnMadden how about Wikipedia's article on the bootstrap method? – Sextus Empiricus Jan 10 '23 at 19:46
  • 1
    @SextusEmpiricus but bootstrap's guarantees are asymptotic; I thought you were talking about model-based simulation of some sort. – John Madden Jan 10 '23 at 19:48
  • @JohnMadden, I am not sure what is wrong with bootstrapping, but it is just an example of how one does not necessarily need a normal distribution to generate confidence intervals. Simulations can be performed in several ways. – Sextus Empiricus Jan 10 '23 at 20:13
  • @SextusEmpiricus maybe it would help our conversation to be a little more precise, and to distinguish between confidence intervals and asymptotic confidence intervals. Bootstrapped intervals are only confidence intervals asymptotically, so far as I understand. I thought you were referring to using MC to generate exact intervals, but maybe I misunderstood. – John Madden Jan 10 '23 at 21:09
  • @JohnMadden fair enough, replace confidence interval with asymptotic confidence intervals in all my previous comments. It is the same for the use of a normal approximation to estimate a confidence interval (that won't be exact either). The point remains the same, and that is that a frequentist approach can work without a normal approximation. The normal approximation is not essential for the method (a lot of frequentist statistics is without normal distributions), and it is only a practical technique to help computations. – Sextus Empiricus Jan 10 '23 at 21:18
  • 2
    "The central limit theorem (CLT) is central to any field of statistics. That's why it is called central!" << No, it's called the "central limit" theorem because it literally says that the limit of a sum of independent variables will always be centered. – Stef Jan 11 '23 at 10:33
  • 1
    @Stef I agree that there can be many interpretations to the term 'central' but at least the inventor of the term George Pólya who used it in 1920 intended the term 'central' to refer to the importance and general application of the limit. – Sextus Empiricus Jan 11 '23 at 11:47
  • @Stef I have added additional clarification in the text/post/answer. – Sextus Empiricus Jan 11 '23 at 12:10
  • @SextusEmpiricus Oh I think I finally understand why we're talking past each other. In my answer, by "The Frequentist needs asymptotics for reason XYZ" I meant "The Frequentist can benefit from asymptotics" rather than "The Frequentist is obligated to use asymptotics for reason XYZ". – John Madden Jan 11 '23 at 13:53
  • @JohnMadden when you use the verb 'needs' then this is very different from 'benefits'. With Bayesian techniques one just as well uses estimates. The benefits of estimation are all around and not something specific for frequentist inference. – Sextus Empiricus Jan 11 '23 at 14:03
  • "Real normal distributions do not exist in nature" - neither does anything exist that guarantees that assumptions of the CLT hold. – Christian Hennig Jan 11 '23 at 14:46
  • @SextusEmpiricus Wow, awesome. Thanks for that quote. – Stef Jan 11 '23 at 14:51
  • @ChristianHennig Those assumptions do not need to hold in order for an approximation with a normal distribution to make sense. The point that I wanted to make with that sentence is that normal distributions are not occuring in nature because something is intrinsically normal distributed. Instead, nearly every time that we can describe something with a normal distribution (and Bayesian techniques use the normal distribution a lot as well) it is because of the mechanism behind the CLT (whether the assumptions are exactly true or not are irrelevant details for that point). – Sextus Empiricus Jan 11 '23 at 15:40
  • @SextusEmpiricus If the assumptions of the CLT don't hold, what is the relation of the CLT to the fact that many things in nature can be approximated well by a normal? Also, what does "we can describe something by a normal"/"normal approximation makes sense" actually mean? Obviously we can also describe something pretty badly. (Note that I'm playing devil's advocate here - chances are that in practice there may be not much disagreement between us, however I want to stress that connecting reality to mathematical theory is by no means simple.) – Christian Hennig Jan 11 '23 at 16:06
  • One could claim that "assumptions hold approximately", but it is very hard if not impossible to nail that claim down precisely, and what the evidence for it is. (That many things in nature are approximately normal can not be used as evidence, because the claim is about distinguishing reasons why this is so.) – Christian Hennig Jan 11 '23 at 16:10
  • 1
    @ChristianHennig I agree that the CLT does not hold exactly in nature (already infinity is never obtained, and the idea of the limit is an ideal theoretical concept) and nothing is truly normal distributed. But that is beyond my point. The idea is that normal approximations work and are ubiquitous wherever sums of variables arrise. As a consequence, whenever a Bayesian technique uses a normal distribution then in a way it is using the CLT or the idea behind it. – Sextus Empiricus Jan 11 '23 at 16:18
  • The normal distribution is a way to simplify more complex distributions in nature. It is not a thing that naturally exists by itself, and something that is specifically required by frequentist statistics that does not occur in Bayesian statistics. – Sextus Empiricus Jan 11 '23 at 16:25
  • The normal distribution has a number of properties apart from the CLT that would also justify its use in some situations, for example maximum entropy/minimum information for given variance, multivariate dependence structure characterised by up to second moments, mean being its ML estimator (this was how Gauss actually derived the normal). – Christian Hennig Jan 11 '23 at 17:14
  • 1
    @ChristianHennig being a maximum entropy distribution is related to being a sum of many little variables. You can see a maximum entropy distribution as arrising from many random operations that preserve some parameters (like in the case of the normal distribution the mean and variance). – Sextus Empiricus Jan 11 '23 at 17:31
  • 1
    There are many different types of distributions. Often there is some mechanism behind a distribution. For instance a Bernoulli distribution relates to coin flips. In the case of a Normal distribution, when it (approximately) occurs in nature then it is always because of some mechanism that involves a summation of many little variables. There's no other procesa that generates a Normal distribution. So, whenever somebody uses a Normal distribution (and this is far from being exclusive to frequentist methods) then in a way one applies the CLT or normal approximation. – Sextus Empiricus Jan 11 '23 at 17:37
  • 1
    Or maybe I should state it like this: A Normal distribution does not occur as an atomic distribution. There is nothing inherently Normal distributed and when a Normal distribution 'occurs' then it is always due to some process that sums several smaller variables (e.g like a Galton board where a ball is hitting multiple times a pin before ending up in a bin) – Sextus Empiricus Jan 11 '23 at 17:39
  • @SextusEmpiricus: Insightful point on the Normal distribution never occurring as atomic; I haven't realized that and I can't think of a counterexample; the Normal distribution and the CLT seem to always go together. – ColorStatistics Jan 12 '23 at 01:13
4

Reproduced verbatim from the Wikipedia page:

In Bayesian inference, the Bernstein-von Mises theorem provides the basis for using Bayesian credible sets for confidence statements in parametric models. It states that under some conditions, a posterior distribution converges in the limit of infinite data to a multivariate normal distribution centered at the maximum likelihood estimator with covariance matrix given ${\displaystyle n^{-1}I(\theta _{0})^{-1}}$, where $\theta _{0}$ is the true population parameter and ${\displaystyle I(\theta _{0})}$ is the Fisher information matrix at the true population parameter value.

The Bernstein-von Mises theorem links Bayesian inference with frequentist inference. It assumes there is some true probabilistic process that generates the observations, as in frequentism, and then studies the quality of Bayesian methods of recovering that process, and making uncertainty statements about that process. In particular, it states that Bayesian credible sets of a certain credibility level $\alpha$ will asymptotically be confidence sets of confidence level $\alpha$, which allows for the interpretation of Bayesian credible sets.

With the reference

van der Vaart, A.W. (1998). "10.2 Bernstein–von Mises Theorem". Asymptotic Statistics. Cambridge University Press. ISBN 0-521-49603-9.

Xi'an
  • 105,342
  • 3
    Thank you for putting the Bernstein-von Mises theorem on the map for me. My guess is that in BDA Andrew Gelman had this theorem in mind when he referred to approximating the posterior with a multivariate normal distribution. I find it quite fascinating. – ColorStatistics Jan 11 '23 at 11:15