How does an ideal prior distribution needs a probability mass on zero to reduce variance, and have fat tails to reduce bias?

Question

I am reading this article about the horseshoe prior and how it is better than lasso and ridge priors. The author makes several points that I don't understand. One of them is "The ideal prior distribution will put a probability mass on zero to reduce variance, and have fat tails to reduce bias". Another is "both double-exponential and normal distributions have thin tails, and the probability mass they put at 0 is 0" How does the normal distribution have a zero probability mass at zero?. He goes on to explain how this makes the horseshoe prior better than LASSO and RIDGE priors. Yet, I don't understand why is this so and how does the probability mass at zero and at the tails improves the bias that occurs in the LASSO distribution.

My main question is ..what is the problem in LASSO and RIDGE priors that the horseshoe prior solves?

( i feel like questions based on towardsdatascience blog posts should be banned - there is no quality control). better off reading the original article eg https://proceedings.mlr.press/v5/carvalho09a/carvalho09a.pdf — seanv507, Mar 07 '23 at 11:50

Ben · Answer 1 · 2023-03-07T21:23:25.287

The MAP estimator can have non-zero probability mass at a point (even if the posterior distribution is always continuous)

The linked article is actually a bit misleading on this point, since even under the stipulated model all the relevant distributions are still continuous, so there is still zero probability mass at the point $\beta=0$. This is typical in penalised regression models, so it is misleading to describe things in the way the author has done. The issue here is really about the difference between properties of a posterior distribution, versus properties of a point estimator formed by taking the posterior mode (called the MAP estimator).

In regard to your first query, you should note that probability mass refers to actual probability ---not probability density--- so if a random variable has any continuous distribution then it has zero probability mass at any single point. This is true of the normal distribution, just as with other continuous distributions. The stipulated model in the linked post also uses a continuous prior distribution for the coefficient parameter in the regression.

The real issue here (which is obscured by the misleading language of the linked post) is that the estimator $\hat{\beta}$ used in penalised regression analysis can have a non-zero probability of being zero even when you use a continuous prior distribution for the true coefficient parameter. To see this, we first note that the estimator is obtained by maximising the log-likelihood plus penalty, which is equivalent to a MAP estimator (see this related answer for how these two approaches link to each other). Under certain specifications of the penalty function (equivalently the prior in Bayesian analysis) the MAP estimator has a non-zero probability of being equal to zero. In other words, it is possible to have:

$$\mathbb{P}(\beta=0) = 0 \quad \quad \quad \quad \quad \mathbb{P}(\hat{\beta} = 0) > 0.$$

This possibility may seem a bit subtle and it requires some explanation. A continuous prior leads to a continuous posterior in the regression analysis, so there is zero probability mass a posteriori at any given point in the parameter space. However, although every parameter value has zero probability mass a posteriori, the mode of the posterior (which is used as the point estimator) may be the same under a wide enough set of sampling outcomes that it has a non-zero probability of falling at a given point. This is a common occurrence in penalised regression analysis.

But why does the horseshoe shape matter here?. I am not sure what this picture https://i.imgur.com/eSYh5f7.png tries to explain? Is it $\beta$ or the scale parameter $\lambda_m$ that has the horseshoe shape?. — user3911153, Mar 07 '23 at 04:33
Also, with respect to you last paragraph, why does the LASSO have a zero mode for the posterior? — user3911153, Mar 07 '23 at 04:38
the horseshoe is of the transformed regularisation parameter $\kappa_i = 1/(1+\lambda_i^2)$. — seanv507, Mar 07 '23 at 11:50
@seanv507, but why is the name horseshoe used instead of just half-Cauchy? What role does the horseshoe shape play? Also, the horseshoe is related to the shape of the scale parameter not $\beta$ so why is that even important? — user3911153, Mar 07 '23 at 12:01

score 7 · Answer 2 · answered Mar 06 '23 at 23:56

The idea is that you want your regularisation procedure to set small parameter estimates to zero and leave large estimates unchanged. Now, lasso does zero out small estimates (ridge doesn't even do that), but both lasso and ridge shrink large estimates towards zero, which is a significant source of bias in the two procedures.

For some intuition about why fat-tailed priors tend to leave large values relatively untouched, consider the ultimate fat-tailed prior, which is the improper flat pior $\pi(\beta) \propto 1$. In this case, the regression estimates are usual least-squares estimates and so are completely unbiased.

As for the probability mass question, both the normal and Laplace/double-exponential distributions have zero probability mass at zero in the sense that $\Pr_\pi(\beta = 0) = 0$. The advantage of having non-zero prior mass at zero $\Pr_\pi(\beta = 0) = p > 0$ is that this allows the posterior distribution of $\beta$ to have a positive probability of being zero, and so the posterior estimates of the regression coefficients are likely to have zeroed-out components. This reduces variance, since very small coefficient estimates are likely to be just be fitted to noise. Again, for intuition, we consider the limiting case, where $p = 1$ and so $\pi$ puts all probability mass at $0$. Now, the posterior estimate is always $0$, and so has zero variance.

But why does the horseshoe shape matter here?. I am not sure what this picture https://i.imgur.com/eSYh5f7.png tries to explain? Is it $\beta$ or the scale parameter $\lambda_m$ that has the horseshoe shape?. — user3911153, Mar 07 '23 at 04:32
can you please explain why the horseshoe prior has nonzero probability mass at zero — user3911153, Mar 07 '23 at 12:24

score 7 · Answer 3 · edited Mar 08 '23 at 09:38

Probability mass at zero

How does the normal distribution have a zero probability mass at zero?

The normal distribution has a non zero density at zero but the probability (mass) is zero $P[X=0] = 0$.

By placing a probability mass at zero the prior is expressing more strongly the believe that a parameter is probably zero. That is helpful in a setting where many regressors are included, of which we believe most are not truly in the model.

Fatter tails

By having fatter tails we allow for a few of the parameters to have more easily larger values. In RIDGE and LASSO the penalty is not only keeping out the 'unwanted' overfitting of noise due to too many parameters, it also makes that the 'correct' model parameters are shrunken. The estimated parameter values with penalization are smaller than with the unbiased ordinary least squares estimate.

What's the difference and which is better?

So you can see this prior as a more extreme variant of LASSO in comparison to ridge, placing even more focus on parameter selection and less on regularising by shrinking parameters.

Note that the one is not neccesarily better than the other. Shrinkage is not always unwanted and regularisation is not all about parameter selection. They are just placing a different focus.

The horseshoe

The name “horseshoe” came from the shape of the distribution if we re-parametrize it using this transformation using shrinkage weight k:

What is this 'shrinkage weight'? It relates to the use of the following prior model:

$$\begin{array}{} \beta_i &\sim& N(0,\tau \lambda_i) \\ \tau &\sim& \text{Half-Cauchy}(0,\tau_0) \\ \lambda_i &\sim& \text{Half-Cauchy}(0,1) \end{array}$$

or in reparameterized form

$$\begin{array}{} \beta_i &\sim& N\left(0,\frac{\tau}{\sqrt{\kappa_i^{-1}-1}} \right) \\ \tau &\sim& \text{Half-Cauchy}(0,\tau_0) \\ \kappa_i &\sim& \text{Beta}\left(\frac{1}{2},\frac{1}{2}\right) \end{array}$$

The relationship between the beta distribution and the half-cauchy distribution (whose square is F distributed), can be seen when we rewrite the reparameterization as $\lambda^2 = \frac{1-\kappa}{\kappa}$, and that resembles the transformation between the F-distribution and the beta distribution written in several places (e.g. Wikipedia here).

This $\kappa_i$ relates to the size of the prior $N\left(0,\frac{\tau}{{\kappa_i^{-1}-1}} \right)$ and makes it either a point mass when $\kappa = 1$ or a heavy tailed distribution when $\kappa = 0$.

But why does the horseshoe shape matter here?. I am not sure what this picture https://i.imgur.com/eSYh5f7.png tries to explain? Is it $\beta$ or the scale parameter $\lambda_m$ that has the horseshoe shape?. — user3911153, Mar 07 '23 at 04:33
@user3911153 it is $\kappa_m = 1(1+\lambda_m^2)$ that has the horseshoe shape. — Sextus Empiricus, Mar 07 '23 at 09:21
This is what I don't understand! When you say "makes it either a point mass when $\kappa = 0$ or a heavy tailed distribution when $\kappa = 1$" How does $\kappa$ (or rather $\lambda$) decide what value to have? i.e. how does it know whether the value of $\beta$ is zero or not so that it decided to have a point mass or a heavy tailed distribution? — user3911153, Mar 07 '23 at 12:20
@user3911153 you can see $\beta_i \sim N(0,\frac{\tau}{\sqrt{\kappa_m^{-1}-1}} ) $ as a mixture distribution of multiple Gaussian distributions with different variances that ranges from zero variance (which relates to a point mass delta function) to infinite variance (which relates to an improper uniform prior distribution). The prior on $\kappa$ regulates how much we concentrate on the different components in the mixture. The horseshoe distribution for $\kappa$ makes that we are mixing the extremes on this scale. — Sextus Empiricus, Mar 07 '23 at 13:22
btw I erroneously switched the role of $\kappa =1$ and $\kappa =0$ in my previous answer. All these multiple division terms can be confusing. — Sextus Empiricus, Mar 07 '23 at 13:37
You say "the prior on $\kappa$ regulates how much we concentrate on the different components in the mixture". But how does this work? I mean how does $\kappa$ know which $\beta_i$ needs less or more variance since the prior on $\kappa$ doesn't depend on $\beta$? — user3911153, Mar 07 '23 at 15:18
@user3911153 the prior doesn't know this. It is just an expression of our prior information/knowledge. That information is expressed as a mixture of either the $\beta_i$ plays no role at all, or to the $\beta_i$ plays a nearly full role. When we add data, then the information will change. (It suddenly sounds like Schrödinger's cat to me) — Sextus Empiricus, Mar 07 '23 at 15:55
So is it just that the prior for $\lambda$ is chosen in such a way that the marginal distribution for $\beta$ has the desired properties we want (i.e. the probability mass at zero and the fat tails)? — user3911153, Mar 07 '23 at 18:36