Why is the exponential family so important in statistics?

Question

I was recently reading about the exponential family within statistics. As far as I understand, the exponential family refers to any probability distribution function that can be written in the following format (notice the "exponent" in this equation):

This includes common probability distribution functions such as the normal distribution, the gamma distribution, the Poisson distribution, etc. Probability distributions from the exponential family are often used as the "link function" in regression problems (e.g., in count data settings, the response variable can be related to the covariates through a Poisson distribution) - probability distribution functions that belong to the exponential family are often used due to their "desirable mathematical properties". For example, these properties are the following:

Why are these properties so important?

A) The first property is about "sufficient statistics". A "sufficient statistic" is a statistic that provides more information for any given data set/model parameter compared to any other statistic.

I am having trouble understanding why this is important. In the case of logistic regression, the logit link function is used (part of the exponential family) to link the response variable with the observed covariates. What exactly are the "statistics" in this case (e.g.. in a logistic regression model, do these "statistics" refer to the "mean" and "variance" of the beta-coefficients of the regression model)? What are the "fixed values" in this case?

B) Exponential families have conjugate priors.

In the Bayesian setting, a prior p(thetha | x) is called a conjugate prior if it is in the same family as the posterior distribution p(x | thetha). If a prior is a conjugate prior - this means that a closed form solution exists and numerical integration techniques (e.g., MCMC) are not required to sample the posterior distribution. Is this correct?

C) Is the third property essentially similar to the second property?

D) I don't understand the fourth property at all. Variational Bayes are an alternative to MCMC sampling techniques that approximate the posterior distribution with a simpler distribution - this can save computational time for high dimensional posterior distributions with big data. Does the fourth property mean that variational Bayes with conjugate priors in the exponential family have closed form solutions? So any Bayesian model that uses the exponential family does not require MCMC - is this correct?

References:

we wouldn't use the distributions just because they have "attractive" properties though, right? Normal and Poisson distribution are simply ubiquitous in nature, that's the main reason why we use them. — Aksakal, Nov 17 '21 at 03:08
I would argue they're not really ubiquitous at all; they're good approximations in some cases, but even in the physics examples that are often held up as "exact" for one or the other, they're clearly neither (e.g. clicks in a Geiger counter measuring radioactive decay was traditionally held up as an exemplar of a homogeneous Poisson process, but it literally can't be the case). Of course that doesn't diminish their usefulness as models of many real-world processes, and that would be an excellent reason. — Glen_b, Nov 17 '21 at 07:17
@stats555 Obviously reasons B-D could not be compelling to a frequentist, but GLMs are very often used in that framework. However, in large data applications - and perhaps even moreso in online and distributed calculation frameworks - the benefits of (A) might be highly relevant. — Glen_b, Nov 17 '21 at 07:19
@Glen_b clicks may not be but the radioactive decay itself is certainly exactly poisson. The optimal solution to Heisenberg’s inequality equation is Gaussian etc — Aksakal, Nov 17 '21 at 12:34
Your use of the word "certainly" is over-optimistic. Physical theories are refined over time (like Newton's laws of motion; they're great approximations but only that; if Newton had said "certainly " he'd be wrong); this is certain to happen again and again. We know GR and quantum mechanics don't fit together as is, for example. In any case, note that the amount of material available to decay changes with time; hence my explicit mention of "not homogeneous Poisson"; ... ctd — Glen_b, Nov 17 '21 at 23:38
ctd.. even if I had a series of Poisson counts, with changing mean using a single Poisson to describe all of them would indeed be an approximation. — Glen_b, Nov 17 '21 at 23:40
@Glen_b the complexity of nature can deviate due to such things like a varying mean, but over short periods of time a Poisson distribution might be a good model. The same is true for the normal distribution. From Gauss's measurements of the position of celestial bodies till the discovery of the Higgs particle, these distributions are being used a lot as models for descriptions of nature. — Sextus Empiricus, Nov 18 '21 at 06:25
I've been saying the model could be a good approximation the whole time, so I don't have any dispute with "good model". I object to claims of exactness only. — Glen_b, Nov 18 '21 at 09:57
@Glen_b, ok, I was under the impression that you also had a dispute with the models being ubiquitous. Also, I would not just consider the models just as useful, and there is more. Possibly the theory might be actually really true (I agree it is not certain), and it is only the realization in practice where the outcomes are not exactly described by ideal models because the conditions are not exact and as 'pretty' as in theoretical examples. — Sextus Empiricus, Nov 18 '21 at 12:13
I guess that the point of @Aksakal was that some distributions from the exponential family, like the Poisson distribution and the normal distribution, are not being used just because they have attractive easy properties, but also because they theoretically match with descriptions of nature. To counter this point with a discussion about whether it is 'exact' or whether 'certainly' is overoptimistic is besides the point. ... — Sextus Empiricus, Nov 18 '21 at 12:24
... The fact is that the normal distribution, Poisson distribution, and many (if not most) others had been in use much before those attractive properties of the exponential distribution became studied. Those properties are an afterthought and not the reason why the exponential family is so important. — Sextus Empiricus, Nov 18 '21 at 12:25
But maybe this becomes semantic now. What does the OP mean by important? Is the question 'why the exponential family is important' as in 'why do we use so often members from this family' or maybe it is more meant to be 'what is so important about the exponential family' as in what is so special about it to study the properties and why is the term 'exponential family' used so often? I guess the question is more about why the concept of exponential families is important and not about why 'exponential family' and it's members are important. — Sextus Empiricus, Nov 18 '21 at 12:29
Oh, I misunderstood which part we were talking about. Well, I guess I would object to the word ubiquitous as well "very commonly used" or "widespread" perhaps, but actually ubiquitous? I wonder if we may be understanding the meaning of the word differently. I don't dispute an assertion of importance for the models, but I wouldn't claim for them properties I can't seem to justify. — Glen_b, Nov 18 '21 at 22:26
IIUC, one point is that they are all effects of the combinatorial distribution, with sample n, on an infinite population, with various scale shifts (see Sterling's approximation of n!). Sort of a 'one more dice roll / coin flip' perception of distribution(s). — Philip Oakley, Nov 19 '21 at 16:32

John Madden · Accepted Answer · 2021-11-28T23:08:56.087

Excellent questions.

Regarding A: A sufficient statistic is nothing more than a distillation of the information that is contained in the sample with respect to a given model. As you would expect, if you have a sample $x_i \sim N(\mu,\sigma^2)$ for $i \in \{1, \ldots, N\}$ and each independent, it is clear that so long as we calculate the sample mean and sample variance, it doesn't matter what the values of each $x_i$ are. In linear regression (easier to talk about than logistic in this context), the sampling distribution of the unknown coefficient vector (for known variance) is $N(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}, \sigma^2\mathbf{X}^\top\mathbf{X})^{-1})$, so as long as these final quantities are identical, inference based thereupon while be too. This is the idea of sufficiency.

Note that in the $N(\mu,\sigma^2)$ example, the sufficient statistic comprises of just two numbers: $\hat{\mu}=\frac{1}{N}\sum_{i=1}^N x_i$ and $\frac{1}{N}\sum_{i=1}^N (x_i-\hat{\mu})^2$, no matter how big our sample size $N$ is (and assuming $N>2$). Likewise, the vector $(\mathbf{X}^\top\mathbf{X})^{-1}\mathbf{X}^\top\mathbf{y}$ is of dimension $P$ and $\sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$ of dimension $P\times P$ (here $P$ is the dimension of the design matrix), which are both independent of $N$ (though, technically, the matrix $\sigma^2(\mathbf{X}^\top\mathbf{X})^{-1}$ is just a constant under our assumptions). So in these examples, the sufficient statistic has a fixed number of values (not fixed values), or as I would put it, fixed dimension.

Let's note three more things. First, that there is no such thing as the sufficient statistic for a distribution, rather, there are many possible statistics which may be sufficient, and which may be of different dimension. Indeed, our second thing to discuss is that the entire sample itself, since it naturally contains all information contained in itself, is always a sufficient statistic. This is a trivial case, but an important one, as in general one cannot always expect to find a sufficient statistic of dimension less than $N$. And the final thing to note is model specificity: that's why I wrote with respect to a given model above. Changing your likelihood will change the sufficient statistics, at least potentially, for a given dataset.

Regarding B: What you're saying is correct, but additionally to allowing analytic posteriors in the univariate case, conjugacy has serious benefits in the context of Bayesian hierarchical models estimated via MCMC. This is because conditional posteriors are also available in closed form. So we can actually accelerate Metropolis-within-Gibbs style MCMC algos with conjugacy.

Regarding C: It's definitely a similar idea, but I do want to make clear that we're talking about two different distributions here: "posterior" versus "posterior predictive". As the name implies, both of these are posterior distributions, which means that they are distributions of an unknown variable conditioned on our known data. A "posterior" plain and simple usually refers to something like $P(\mu, \sigma^2| \{x_1, \ldots, x_N\})$ from our normal example above: a distribution of unkown parameters defined in the data generating distributions. In contrast, a "posterior predictive" gives the distribution of a hypothetical $N+1$'st data point $x_{N+1}$ conditional on the observed data: $P(x_{N+1}| \{x_1, \ldots, x_N\})$. Notice that this is not conditional on the parameters $\mu$ and $\sigma^2$: they had to be integrated out. It is this additional integral that is guaranteed by conjugacy.

Regarding D: In the context of Variational Bayes (VB), you have some posterior distribution $P(\theta|X)$ where $\theta$ is some vector of $P$ parameters and $X$ are some data. Rather than trying to generate a sample from it like MCMC, we are instead going to use an approximate posterior distribution that's easy to work with and pretty close to the true one. That's called a variational distribution and is denoted $Q_\eta(\theta)$. Notice that our variational distribution is indexed by variational parameters $\eta$. Variational parameters are nothing like the parameters we do Bayesian inference on, and are nothing like our data. They don't have a distribution associated with them and they don't have some hypothetical role generating the data. Rather, they are chosen as a result of an iterative optimization algorithm. The whole idea of variational inference is to define some measure of dissimilarity between the variational distribution and the true posterior and then minimize that measure with respect to the parameters $\eta$. We'll denote the result of that optimization process by $\hat{\eta}(X)$. At that point, hopefully $Q_{\hat{\eta}(X)}(\theta)$ is pretty close to $P(\theta|X)$, and if we do inferences using $Q_{\hat{\eta}(X)}(\theta)$ instead we'll get similar answers.

Now where does conjugacy fit in? A popular way to measure dissimilarity is this measure, which is called the reverse KL cost:

$$ \hat{\eta}(X) := \underset{\eta}{\textrm{argmin}}\, \mathbb{E}_{\theta\sim Q_\eta}\bigg[\frac{\log Q_{\eta}(\theta)}{\log P(\theta|X)}\bigg] $$

This integral cannot be solved in terms of simple functions in general. However, it is available in closed form when:

We use a conjugate prior to define $P(\theta|X)$.
We assume that variational distribution is independent, so in other words that $q_\eta(\theta)=\prod_{j=1}^P q_{j,\eta}(\theta_j)$.
We further restrict ourselves to a particular $q_{j,\eta_j}$ for each $j$ (which is determined by the likelihood).

So it's not that the variational posterior is available in closed form. Rather, it's that the cost function which defines the variational posterior is available in closed form. The cost function being closed form makes computing the variational distribution an easier optimization problem, since we can analytically compute function values and gradients.

@ John Madden: Thank you so much for your answer! I was not expecting such a detailed answer! I will have to re-read it several times to fully understand it! As of now, I have the following questions based on your answer: — stats_noob, Nov 17 '21 at 05:02

score 15 · Answer 2 · answered Nov 17 '21 at 19:02

15

Since I do not see the result mentioned on that thread so far, let me mention an often neglected if significant issue, namely that, concerning question A, exponential families are closely linked with the notion of sufficiency due to the Pitman-Koopman-Darmois lemma:

Suppose $X_n$ , $n = 1 , 2 , 3 , \dots$ are independent identically distributed random variables whose distribution is known to be in some family of probability distributions with fixed support. Only if that family is an exponential family there is a sufficient statistic (possibly vector-valued) $\displaystyle T(X_{1},\dots ,X_{n})$ whose number of scalar components does not increase as the sample size $n$ increases. Wikipedia

In short, there is no fixed-support family outside the exponential families that enjoys a fixed dimension sufficient statistic.

And while I am at it, point B is not a compelling argument as closed-form expressions are not a strong argument for defining a prior distribution that reflects one's prior beliefs. Furthermore, there exist exponential families whose conjugates are not manageable, witness the Beta distribution. Conjugates are thus better seen as handy approximations in the spirit of question D.

answered Nov 17 '21 at 19:02

Xi'an

105,342

1

What does fixed-support family mean here? I understand support as where the pdf is defined, or slightly more formally its domain. – Single Malt Nov 17 '21 at 19:57
1

It means that the support of the distributions within this family is fixed or constant or the same for all values of the parameter. – Xi'an Nov 17 '21 at 20:01
1

Got it. As an example the normal and exponential distributions have support of the reals and the non-negative reals. The normal support does not depend on mu nor sigma squared, and the exponential support does not depend on lambda. That is they both have fixed support, along with all the other exponential family distributions. – Single Malt Nov 17 '21 at 20:14
6

@SingleMalt that's correct. As a converse example, the family consisting of uniform distributions over $[0,A]$, with $A$ as the parameter, doesn't have fixed support. However, it does have a conjugate prior (the Pareto distribution) and a sufficient statistic of fixed dimension (the maximum value of all samples observed). (The Pitman-Koopman-Darmois theorem is "only if" and not "if and only if") – N. Virgo Nov 18 '21 at 07:15
Distributions with parameter dependent supports bring much more information through the observations, as they allow for the sure exclusion of some values of the parameter. Hence, no surprise that they also allow for some form of sufficiency (and hence of conjugacy). – Xi'an Nov 18 '21 at 14:00
isn't the "number of scalar components does not increase as the sample size n increases" statement equivalent to being memoryless? (another 'get your head around it' term ;-). . . We have the current parameter estimate and can add one more sample and update the estimate, without needing to remember previous samples. – Philip Oakley Nov 19 '21 at 16:16
@PhilipOakley: The likelihood is not memoryless as it keeps track of all "previous" observations (when a time element is involved) through the sufficient statistic, e.g. $X_1+\cdots+X_{t+1}=X_1+\cdots+X_t+X_{t+1}$. – Xi'an Nov 20 '21 at 09:26

Demetri Pananos · Answer 3 · 2021-11-17T12:55:03.107

Great questions. There are lots of ways to answer these. John Madden does an excellent job, but I'm going to crib a little bit from Ben's answer here regarding sufficient statistics.

The loss function for a Gaussian linear model (as Ben notes) is

$$ \ell_{\mathbf{y}, \mathbf{x}}(\boldsymbol{\beta}, \sigma)=-n \ln \sigma-\frac{1}{2 \sigma^{2}}\|\mathbf{y}-\mathbf{x} \boldsymbol{\beta}\|^{2}$$

In a lot of pedagogical material on fitting models, we would use the data to compute gradients of the loss and perform some sort of optimization routine. Some code to do this might look like

def compute_loss(beta, X, y):
number_rows = len(X)
loss = 0

# Potentially expensive!
for i in range(number_rows):
    loss += (y[i] - X[i]@beta)**2

loss/=number_rows

return loss

If we have lots of data, then this loop (and any additional iterations over the data to compute gradients, for example) might be expensive to compute. But, because we can write down sufficient statistics for the exponential family, we can improve our computation drastically.

As Ben writes, the loss can be rewritten as

$$ \begin{aligned} &=-n \ln \sigma-\frac{1}{2 \sigma^{2}} \mathbf{y}^{\mathrm{T}} \mathbf{y}-\frac{1}{2 \sigma^{2}}\left(2 \boldsymbol{\beta}^{\mathrm{T}} \mathbf{T}_{1}-\boldsymbol{\beta}^{\mathrm{T}} \mathbf{T}_{2} \boldsymbol{\beta}\right) \\ &=h(\mathbf{y}, \sigma)+g_{\beta}\left(\mathbf{T}_{1}, \mathbf{T}_{2}, \sigma\right) \end{aligned} $$

Where $\mathbf{T}_{1} \equiv \mathbf{T}_{1}(\mathbf{x}, \mathbf{y}) \equiv \mathbf{x}^{\mathrm{T}} \mathbf{y}$ and $\mathbf{T}_{2} \equiv \mathbf{T}_{2}(\mathbf{x}, \mathbf{y}) \equiv \mathbf{x}^{\mathrm{T}} \mathbf{x}$. We can thus compute these quantities one time and fit our models, as opposed to recomputing them at each update step.

Let's compare on a simulated problem. I will assume $\sigma=1$ and does not require estimation for simplicity. I'll vary the number of observations and the number of covariates in the model and compare the time to optimize the loss assuming when using sufficient stats and when using a naive approach. Shown below is a plot of the expected time to completion plus/minus a standard deviation. We can see that as the data become larger in size, using sufficient statistics is advantageous.

Now, this whole answer is a bit of a straw man. I've not shown you the results when not using sufficient statistics and not using loops (so maybe leveraging some linear algebra). But the point about sufficient statistics remains. That we can represent all the information in a sample with a single number, computed once, is a very valuable property to have.

Code to reproduce experiments:

import numpy as np
import pandas as pd
from scipy.optimize import minimize
from itertools import product
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_theme(style="darkgrid")
First, simulate big data!
def make_data(N, p):
    X = np.random.normal(size = (N,p))
    beta = np.random.normal(2, 2, size = p)
    y = X@beta + np.random.normal(size = N)
    return X,y
Next, set up a loss function to optimize using loops
def solve_naive(N, p):
    X, y = make_data(N, p)
    def loss_and_grad(w):
    number_rows, number_columns = X.shape
    grads = np.zeros_like(w)
    loss=0
    for i in range(number_rows):
        res= (y[i] - X[i]@w)
        loss+= res**2/number_rows

        grads+= -2*res*X[i]/number_rows

    return loss, grads

r=minimize(loss_and_grad, x0=np.zeros(p), jac=True)
return loss_and_grad



Next, set up a loss function to opimize which only uses sufficient statistics
def solve_sufficient(N, p):
    X, y = make_data(N, p)
T1 = X.T@y
T2 = X.T@X
const = y.T@y
n = len(X)

def loss_and_grad(w):

    loss = const - (2*w@T1 - w@T2@w)

    grads = -2*T1 + (T2@w + T2.T@w)

    return loss/n, grads/n

r=minimize(loss_and_grad, x0=np.zeros(p), jac=True)
return loss_and_grad



A helper function to time the optimization for various datasets
def time_to_optimize(N, p):
naive_optimization_times = %timeit -o -n 10 solve_naive(N, p)
suff_optimization_times = %timeit -o -n 10 solve_sufficient(N, p)

suff = pd.DataFrame({'times':suff_optimization_times.timings,
                     'type': 'Sufficient Statistics'})

naive = pd.DataFrame({'times':naive_optimization_times.timings,
                     'type': 'Naive'})

df = pd.concat((suff, naive))


df['N'] = N
df['p'] = p

return df


if name == 'main':
Ns = [1_000, 10_000, 100_000]
ps = [10, 100, 250]

prods = product(Ns, ps)
frames = []

for N, p in prods:

    frames.append(time_to_optimize(N, p))


df = pd.concat(frames).reset_index(drop=True)

fig, ax = plt.subplots(dpi = 240, figsize = (8, 5))
grid = sns.lineplot(data=df, x='N', y='times', hue='type', style='p')

grid.set(xscale=&quot;log&quot;, yscale=&quot;log&quot;, xlabel = 'Number of Observations', ylabel='Execution Time (Seconds)')
grid.legend(loc='best', prop = {'size':6})

@ Demetri: Thank you so much for your answer! I will have to read it over a few times - I left some questions for @John Madden in the comment section. Could you please take a look at them if you have time? Thank you so much! — stats_noob, Nov 17 '21 at 05:15

Aksakal · Answer 4 · 2021-11-18T17:44:28.303

None of the properties that OP mentions were important when the most popular exponential family distributions were discovered or were put in use. That is not to say that the properties are irrelevant or not important. These are all interesting and useful features, but these are not the reasons for popularity of the distributions. Distributions such as Gaussian and Poisson became popular because they are ubiquitous in nature, and also approximate very well processes in engineering and the industry.

Poisson. Here's one example: Poisson distribution. It and its related distributions, such as exponential distribution, are observed in radioactive decay phenomenon. Not only it is simply convenient to approximate the observed data with them, but they arise from the theory of radioactive decay in their exact form, and the observations are consistent with equations beyond any doubt with arbitrary precision limited only by the precision of our instrumentation. This has been discovered in early 20th century and ever increasing precision of measurements has not produced any deviation from the exact equations. At this point it's simply a fact that Poisson distribution explains the number of decaying nuclei in a given period of time, and that time between two events is from exponential distribution.

Gaussian. Gaussian distribution is observed in numerous natural phenomena, e.g. Brownian motion. I gave an exotic example in my comment to the question: Heisenberg's uncertainty principle (UP) from quantum mechanics (QM).

The first thing to note here is that physicists so far failed to find any deviation from QM finding. The precision with which it's confirmed to work is mind boggling and unprecedented in any field of study in sciences. So, you can safely accept UP as a fact of nature.

In its simple form it states that the uncertainties about the measurement of the moment and coordinate of a particle simultaneously has a lower limit:$$\Delta p\Delta x\ge h/4\pi$$ If you measure the coordinate very precisely, then the moment measurement will be very imprecise etc.

But when does the inequality become equality $\Delta p\Delta x= h/4\pi$? For a Gaussian wave packet!

We don't always have a choice. In the above examples we do not pick Poisson or Gaussian because they have some nice statistical properties. We use them because we must use them. That's how nature works, and if we want to describe the observed phenomena then we don't have choice but to use these distributions regardless of their convenience or elegance.

score 2 · Answer 5 · answered Dec 05 '22 at 20:29

Regarding a point you raise in A: the sufficient statistic for the parameters $\beta$ of a logistic regression - that's a good question. There is no analytical solution to Logistic regression, only numerical. So, calling the final solution a function of the data $T(X,y)$ is a bit of a stretch to the definition of a function. It's true that you use the data in the fitting process - and technically you need only $X$ and $X^Ty$ - but there are other considerations: like the fitting algorithm (do you use Fisher-Scoring? Gradient Ascent? What threshold do you use? What learning step for the gradient?) Maybe others can answer if there is such a thing as a sufficient statistic for the logistic regression model.

Regarding point D: one of the main algorithms for Variational Inference (VI) / Variational Bayes - is called CAVI - Coordinate Ascent VI. It uses the "mean field" approximation (what John Madden called "Independent") $q(z)=\prod_j q_j(z_j)$. The CAVI update rule says that for each coordinate of the parameter, you update the variational distribution to be $q_j(z_j)\propto e^{\mathbb E_{-j}[\log p(z_j|z_{-j},x)]}$. This can be hard and require quite a lot of analytical work to find out what is this distribution, and what are it's parameters (take expectation on all other variational distributions, etc.). If you further assume that each "complete conditional" $p(z_j | x, z_{-j})$ is in the expo. family, the algorithm simplifies: the variational distributions have the same exponential family form as their complete-conditionals counterparts, and the update rule becomes simply updating the natural parameters of that family, $\theta_j$, by setting $\theta_j=\mathbb E_{-j}[\theta_{-j}]$ (where $\theta_{-j}$ will be a function of the conditioning set $x,z_{-j}$). For more information check sections 4.1 and 4.2 in "Variational Inference: A Review for Statisticians" here.

Why is the exponential family so important in statistics?

5 Answers5

First, simulate big data!

Next, set up a loss function to optimize using loops

Next, set up a loss function to opimize which only uses sufficient statistics

A helper function to time the optimization for various datasets