41

Why is the exponential family so important in statistics?

I was recently reading about the exponential family within statistics. As far as I understand, the exponential family refers to any probability distribution function that can be written in the following format (notice the "exponent" in this equation):

Enter image description here

This includes common probability distribution functions such as the normal distribution, the gamma distribution, the Poisson distribution, etc. Probability distributions from the exponential family are often used as the "link function" in regression problems (e.g., in count data settings, the response variable can be related to the covariates through a Poisson distribution) - probability distribution functions that belong to the exponential family are often used due to their "desirable mathematical properties". For example, these properties are the following:

Enter image description here

Why are these properties so important?

A) The first property is about "sufficient statistics". A "sufficient statistic" is a statistic that provides more information for any given data set/model parameter compared to any other statistic.

I am having trouble understanding why this is important. In the case of logistic regression, the logit link function is used (part of the exponential family) to link the response variable with the observed covariates. What exactly are the "statistics" in this case (e.g.. in a logistic regression model, do these "statistics" refer to the "mean" and "variance" of the beta-coefficients of the regression model)? What are the "fixed values" in this case?

B) Exponential families have conjugate priors.

In the Bayesian setting, a prior p(thetha | x) is called a conjugate prior if it is in the same family as the posterior distribution p(x | thetha). If a prior is a conjugate prior - this means that a closed form solution exists and numerical integration techniques (e.g., MCMC) are not required to sample the posterior distribution. Is this correct?

C) Is the third property essentially similar to the second property?

D) I don't understand the fourth property at all. Variational Bayes are an alternative to MCMC sampling techniques that approximate the posterior distribution with a simpler distribution - this can save computational time for high dimensional posterior distributions with big data. Does the fourth property mean that variational Bayes with conjugate priors in the exponential family have closed form solutions? So any Bayesian model that uses the exponential family does not require MCMC - is this correct?

References:

Peter Mortensen
  • 343
  • 3
  • 10
stats_noob
  • 1
  • 3
  • 32
  • 105
  • 2
    we wouldn't use the distributions just because they have "attractive" properties though, right? Normal and Poisson distribution are simply ubiquitous in nature, that's the main reason why we use them. – Aksakal Nov 17 '21 at 03:08
  • 11
    I would argue they're not really ubiquitous at all; they're good approximations in some cases, but even in the physics examples that are often held up as "exact" for one or the other, they're clearly neither (e.g. clicks in a Geiger counter measuring radioactive decay was traditionally held up as an exemplar of a homogeneous Poisson process, but it literally can't be the case). Of course that doesn't diminish their usefulness as models of many real-world processes, and that would be an excellent reason. – Glen_b Nov 17 '21 at 07:17
  • @stats555 Obviously reasons B-D could not be compelling to a frequentist, but GLMs are very often used in that framework. However, in large data applications - and perhaps even moreso in online and distributed calculation frameworks - the benefits of (A) might be highly relevant. – Glen_b Nov 17 '21 at 07:19
  • 3
    @Glen_b clicks may not be but the radioactive decay itself is certainly exactly poisson. The optimal solution to Heisenberg’s inequality equation is Gaussian etc – Aksakal Nov 17 '21 at 12:34
  • 3
    Your use of the word "certainly" is over-optimistic. Physical theories are refined over time (like Newton's laws of motion; they're great approximations but only that; if Newton had said "certainly " he'd be wrong); this is certain to happen again and again. We know GR and quantum mechanics don't fit together as is, for example. In any case, note that the amount of material available to decay changes with time; hence my explicit mention of "not homogeneous Poisson"; ... ctd – Glen_b Nov 17 '21 at 23:38
  • 2
    ctd.. even if I had a series of Poisson counts, with changing mean using a single Poisson to describe all of them would indeed be an approximation. – Glen_b Nov 17 '21 at 23:40
  • @Glen_b the complexity of nature can deviate due to such things like a varying mean, but over short periods of time a Poisson distribution might be a good model. The same is true for the normal distribution. From Gauss's measurements of the position of celestial bodies till the discovery of the Higgs particle, these distributions are being used a lot as models for descriptions of nature. – Sextus Empiricus Nov 18 '21 at 06:25
  • 2
    I've been saying the model could be a good approximation the whole time, so I don't have any dispute with "good model". I object to claims of exactness only. – Glen_b Nov 18 '21 at 09:57
  • @Glen_b, ok, I was under the impression that you also had a dispute with the models being ubiquitous. Also, I would not just consider the models just as useful, and there is more. Possibly the theory might be actually really true (I agree it is not certain), and it is only the realization in practice where the outcomes are not exactly described by ideal models because the conditions are not exact and as 'pretty' as in theoretical examples. – Sextus Empiricus Nov 18 '21 at 12:13
  • 1
    I guess that the point of @Aksakal was that some distributions from the exponential family, like the Poisson distribution and the normal distribution, are not being used just because they have attractive easy properties, but also because they theoretically match with descriptions of nature. To counter this point with a discussion about whether it is 'exact' or whether 'certainly' is overoptimistic is besides the point. ... – Sextus Empiricus Nov 18 '21 at 12:24
  • ... The fact is that the normal distribution, Poisson distribution, and many (if not most) others had been in use much before those attractive properties of the exponential distribution became studied. Those properties are an afterthought and not the reason why the exponential family is so important. – Sextus Empiricus Nov 18 '21 at 12:25
  • But maybe this becomes semantic now. What does the OP mean by important? Is the question 'why the exponential family is important' as in 'why do we use so often members from this family' or maybe it is more meant to be 'what is so important about the exponential family' as in what is so special about it to study the properties and why is the term 'exponential family' used so often? I guess the question is more about why the concept of exponential families is important and not about why 'exponential family' and it's members are important. – Sextus Empiricus Nov 18 '21 at 12:29
  • Oh, I misunderstood which part we were talking about. Well, I guess I would object to the word ubiquitous as well "very commonly used" or "widespread" perhaps, but actually ubiquitous? I wonder if we may be understanding the meaning of the word differently. I don't dispute an assertion of importance for the models, but I wouldn't claim for them properties I can't seem to justify. – Glen_b Nov 18 '21 at 22:26
  • IIUC, one point is that they are all effects of the combinatorial distribution, with sample n, on an infinite population, with various scale shifts (see Sterling's approximation of n!). Sort of a 'one more dice roll / coin flip' perception of distribution(s). – Philip Oakley Nov 19 '21 at 16:32

5 Answers5

28

Excellent questions.

Regarding A: A sufficient statistic is nothing more than a distillation of the information that is contained in the sample with respect to a given model. As you would expect, if you have a sample $x_i \sim N(\mu,\sigma^2)$ for $i \in \{1, \ldots, N\}$ and each independent, it is clear that so long as we calculate the sample mean and sample variance, it doesn't matter what the values of each $x_i$ are. In linear regression (easier to talk about than logistic in this context), the sampling distribution of the unknown coefficient vector (for known variance) is $N(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}, \sigma^2\mathbf{X}^\top\mathbf{X})^{-1})$, so as long as these final quantities are identical, inference based thereupon while be too. This is the idea of sufficiency.

Note that in the $N(\mu,\sigma^2)$ example, the sufficient statistic comprises of just two numbers: $\hat{\mu}=\frac{1}{N}\sum_{i=1}^N x_i$ and $\frac{1}{N}\sum_{i=1}^N (x_i-\hat{\mu})^2$, no matter how big our sample size $N$ is (and assuming $N>2$). Likewise, the vector $(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ is of dimension $P$ and $\sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$ of dimension $P\times P$ (here $P$ is the dimension of the design matrix), which are both independent of $N$ (though, technically, the matrix $\sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$ is just a constant under our assumptions). So in these examples, the sufficient statistic has a fixed number of values (not fixed values), or as I would put it, fixed dimension.

Let's note three more things. First, that there is no such thing as the sufficient statistic for a distribution, rather, there are many possible statistics which may be sufficient, and which may be of different dimension. Indeed, our second thing to discuss is that the entire sample itself, since it naturally contains all information contained in itself, is always a sufficient statistic. This is a trivial case, but an important one, as in general one cannot always expect to find a sufficient statistic of dimension less than $N$. And the final thing to note is model specificity: that's why I wrote with respect to a given model above. Changing your likelihood will change the sufficient statistics, at least potentially, for a given dataset.

Regarding B: What you're saying is correct, but additionally to allowing analytic posteriors in the univariate case, conjugacy has serious benefits in the context of Bayesian hierarchical models estimated via MCMC. This is because conditional posteriors are also available in closed form. So we can actually accelerate Metropolis-within-Gibbs style MCMC algos with conjugacy.

Regarding C: It's definitely a similar idea, but I do want to make clear that we're talking about two different distributions here: "posterior" versus "posterior predictive". As the name implies, both of these are posterior distributions, which means that they are distributions of an unknown variable conditioned on our known data. A "posterior" plain and simple usually refers to something like $P(\mu, \sigma^2| \{x_1, \ldots, x_N\})$ from our normal example above: a distribution of unkown parameters defined in the data generating distributions. In contrast, a "posterior predictive" gives the distribution of a hypothetical $N+1$'st data point $x_{N+1}$ conditional on the observed data: $P(x_{N+1}| \{x_1, \ldots, x_N\})$. Notice that this is not conditional on the parameters $\mu$ and $\sigma^2$: they had to be integrated out. It is this additional integral that is guaranteed by conjugacy.

Regarding D: In the context of Variational Bayes (VB), you have some posterior distribution $P(\theta|X)$ where $\theta$ is some vector of $P$ parameters and $X$ are some data. Rather than trying to generate a sample from it like MCMC, we are instead going to use an approximate posterior distribution that's easy to work with and pretty close to the true one. That's called a variational distribution and is denoted $Q_\eta(\theta)$. Notice that our variational distribution is indexed by variational parameters $\eta$. Variational parameters are nothing like the parameters we do Bayesian inference on, and are nothing like our data. They don't have a distribution associated with them and they don't have some hypothetical role generating the data. Rather, they are chosen as a result of an iterative optimization algorithm. The whole idea of variational inference is to define some measure of dissimilarity between the variational distribution and the true posterior and then minimize that measure with respect to the parameters $\eta$. We'll denote the result of that optimization process by $\hat{\eta}(X)$. At that point, hopefully $Q_{\hat{\eta}(X)}(\theta)$ is pretty close to $P(\theta|X)$, and if we do inferences using $Q_{\hat{\eta}(X)}(\theta)$ instead we'll get similar answers.

Now where does conjugacy fit in? A popular way to measure dissimilarity is this measure, which is called the reverse KL cost:

$$ \hat{\eta}(X) := \underset{\eta}{\textrm{argmin}}\, \mathbb{E}_{\theta\sim Q_\eta}\bigg[\frac{\log Q_{\eta}(\theta)}{\log P(\theta|X)}\bigg] $$

This integral cannot be solved in terms of simple functions in general. However, it is available in closed form when:

  1. We use a conjugate prior to define $P(\theta|X)$.

  2. We assume that variational distribution is independent, so in other words that $q_\eta(\theta)=\prod_{j=1}^P q_{j,\eta}(\theta_j)$.

  3. We further restrict ourselves to a particular $q_{j,\eta_j}$ for each $j$ (which is determined by the likelihood).

So it's not that the variational posterior is available in closed form. Rather, it's that the cost function which defines the variational posterior is available in closed form. The cost function being closed form makes computing the variational distribution an easier optimization problem, since we can analytically compute function values and gradients.

John Madden
  • 4,165
  • 2
  • 20
  • 34
  • @ John Madden: Thank you so much for your answer! I was not expecting such a detailed answer! I will have to re-read it several times to fully understand it! As of now, I have the following questions based on your answer: – stats_noob Nov 17 '21 at 05:02
  • 1
  • Property 1 states that all statistics (e.g. mu from a normal distribution, beta coefficients from a regression model, etc.) based on distributions from the exponential family are always sufficient statistics. Does this mean that ALL statistics based on distributions that are not from the exponential are NOT sufficient? Or can you still have sufficient statistics based on distributions not belonging to the exponential family? (e.g the uniform distribution : https://math.stackexchange.com/questions/572867/non-exponential-family-probability-distributions-and-their-uses)
  • – stats_noob Nov 17 '21 at 05:07
  • 1
  • This may sound like a trivial question - but is there any reason that non-sufficient statistics are considered so important? In the event that a statistic is considered non-sufficient, what exactly are we losing? For example, do non-sufficient statistics have very high variance? I am a bit confused: could it be that non-sufficient statistics might still be "good", but sufficient statistics are far "better"? For example - I heard there are some cases where it might be beneficial to use a biased estimator. Could there be any cases where it is beneficial to use a non-sufficient statistic?
  • – stats_noob Nov 17 '21 at 05:10
  • 1
  • Property 2 states that if the prior and the posterior distribution are both from the exponential family and are conjugates, then closed form solutions for the prior distributions exist. This means that we are not required to use MCMC sampling. However, I have seen countless examples online of the prior distributions of beta coefficients from a simple Bayesian Linear Regression model being calculated using MCMC. Is this done as a learning example? Is this unnecessary since Property 2 guarantees the existence of closed form solutions ? Or is this just an exercise to familiarize beginners?
  • – stats_noob Nov 17 '21 at 05:14