Why does shrinkage really work, what's so special about 0?

Question

There is already a post on this site talking about the same issue: Why does shrinkage work?

But, even though the answers are popular, I don't believe the gist of the question is really addressed. It is pretty clear that introducing some bias in estimation brings in reduction in variance and may improve estimation quality. However:

1) Why the damage done by introducing bias is less compared with the gain in variance?

2) Why does it always work? For example in case of Ridge Regression: the existence theorem

3) What's so interesting about 0 (the origin)? Clearly we can shrink anywhere we like (i.e. Stein estimator), but is it going to work as good as the origin?

4) Why various universal coding schemes prefer lower number of bits around the origin? Are these hypotheses simply more probable?

Answers with references to proven theorems or established results are expected.

@KarolisKoncevičius, thanks for fixing the links! Let me note, however, that your language edits might not be very helpful, except for the last one. The other ones seem to add some redundant text and thus make the post slightly less readable. — Richard Hardy, May 23 '19 at 13:10

Martin Modrák · Answer 1 · 2019-05-24T08:22:32.290

1) Why the damage done by introducing bias is less compared with the gain in variance?

It doesn't have to, it just usually is. Whether the tradeoff is worth it depends on the loss function. But the things we care about in real life are often similar to the squared error (e.g. we care more about one big error than about two errors half the size).

As a counterexample - imagine that for college admissions we shrink people's SAT scores a bit towards the mean SAT for their demographic (however defined). If done properly, this will reduce variance and mean squared error of estimates of (some sort of) ability of the person while introducing bias. Most people would IMHO argue that such a tradeoff is unacceptable.

2) Why does it always work?

3) What's so interesting about 0 (the origin)? Clearly we can shrink anywhere we like (i.e. Stein estimator), but is it going to work as good as the origin?

I think this is because we usually shrink coefficients or effect estimates. There are reasons to believe most effects are not large (see e.g. Andrew Gelman's take). One way to put it is that a world where everything influences everything with a strong effect is a violent unpredictable world. Since our world is predictable enough to let us live long lives and build semi-stable civilizations, it follows that most effects are not large.

Since most effects are not large it is useful to wrongfully shrink the few really big ones while also correctly shrinking the loads of negligible effects.

I believe this is just a property of our world and you probably could construct self-consistent worlds where shrinkage isn't practical (most likely by making mean-squared error an impractical loss function). It just doesn't happen to be the world we live in.

On the other hand, when we think of shrinkage as a prior distribution in Bayesian analysis, there are cases where shrinkage to 0 is actively harmful in practice.

One example is the length scale in Gaussian Processes (where 0 is problematic) the recommendation in Stan's manual is to use a prior that puts negligible weight close to zero i.e. effectively "shrinking" small values away from zero. Similarly, recommended priors for dispersion in negative binomial distribution effectively shrink away from zero. Last but not least, whenever the normal distribution is parametrized with precision (as in INLA), it is useful to use inverse-gamma or other prior distributions that shrink away from zero.

4) Why various universal coding schemes prefer lower number of bits around the origin? Are these hypotheses simply more probable?

Now this is way out of my depth, but Wikipedia says that in universal coding scheme we expect (by definition) $P(i) ≥ P(i + 1)$ for all positive $i$ so this property seems to be a simple consequence of the definition and not related to shrinkage (or am I missing something?)

Clearly Andrew Gelman had standard models in mind where we multiply coefficients with inputs. This doesn't necessarily have to be the case. What if we the coefficient inversely comes into the model? Then 0 will blow up things. — Cagdas Ozgenc, May 24 '19 at 05:59
@CowboyTrader Yes and there are real-world use cases where 0 is problematic and we shrink away (added to the answer). So I believe it slightly supports the point that shrinkage towards zero is just a heuristic that happens to work (in practice) frequently, but not a fundamental mathematical truth. — Martin Modrák, May 24 '19 at 08:23
Sorry for my initial reaction. Your answer is getting more meaningful. Note that shrinkage works under other loss functions, not only under square loss. The real issue I am after is why the hell it always work? For mean/location parameters 0 seems to be a magic number. — Cagdas Ozgenc, May 24 '19 at 09:40
@CowboyTrader I might be missing something, but at least in the case of the Stein estimator, the improvement due to shrinkage is a function of the distance between the true values and the point you shrink to, so 0 is not a magic number. Further, if $\sigma$ is large relative to the observed values, the Stein estimator will move the estimate away from zero. So I am not sure the general pattern you speak about actually exists for mean/location. Or are there other examples that always exhibit shrinkage towards zero? — Martin Modrák, May 24 '19 at 13:21
Please read my comments to the other answer. VC, Rademacher related proofs are also revolving around 0. Stein's estimator had a small problem, the positive part Stein estimator (the Baranchik fix) always move estimate towards 0, and it dominates Stein. — Cagdas Ozgenc, May 24 '19 at 13:31
@CowboyTrader You are correct about the correction for Stein, but still, the shrinkage works towards any value and there is nothing special about 0. I admit I don't understand the other connections you allude to. — Martin Modrák, May 24 '19 at 16:14

Adrian · Answer 2 · 2019-05-23T19:55:15.667

0

Ridge, lasso and elastic net are similar to Bayesian methods with priors centered on zero -- see, for example, Statistical Learning with Sparsity by Hastie, Tibshirani and Wainwright, section 2.9 Lq Penalties and Bayes Estimates: "There is also a Bayesian view of these estimators. ... This means that the lasso estimate is the Bayesian MAP (maximum aposteriori) estimator using a Laplacian prior."

One way to answer your question (what's so special about zero?) is that the effects we are estimating are zero on average, and they tend to be small (i.e our priors should be centered around zero). Shrinking estimates towards zero is then optimal in a Bayesian sense, and lasso and ridge and elastic nets can be thought of through that lens.

edited May 23 '19 at 19:55

answered May 23 '19 at 19:49

Adrian

4,384

3

Shrinking to zero is nothing special (except that the equation is simpler because you just multiply the result with a particular factor). You can shrink to any other point as well. The further that point is from the true value, the less good the performance of the shrinking (but for any point there exist some amount of shrinking that will give some performance increase... at least for gaussian distributed variables). So when a result is typically far from zero then shrinking to zero will give only very little improvement. – Sextus Empiricus May 24 '19 at 08:42
2

@MartijnWeterings Clearly putting a prior on the truth itself will be ideal (bulls-eye). But why shrinking to 0 still gives some improvement? That's what I am after. – Cagdas Ozgenc May 24 '19 at 09:36
@CowboyTrader Shrinking to any value gives improvement. That is why it works for 0 as well. – Sextus Empiricus May 24 '19 at 09:54
@MartijnWeterings Yes, but bounds from learning theory are pretty much always based on the origin. They put a ball/polyhedron/etc centered at origin. Is it just a proof convenience? MDL hypotheses encoding encodes integers by giving 0 the shortest codelength? Is it a coincidence? – Cagdas Ozgenc May 24 '19 at 09:59
I do not know what you mean by 'learning theory' or 'MDL'. But anyway, in the case of ridge regression, or lasso, you can put the zero anywhere. It will just be a transformation of the data. In the case of lasso the zero has a very particular reason because it helps to find sparse solutions. In the case of ridge regression I imagine it relates to finding solutions with prior believe that the coefficients should be zero (which is not a bad estimate since the situation for ridge regression is often that we have too many independent variables about which we know a priori that many shouldn't work) – Sextus Empiricus May 24 '19 at 10:06
1

So say you perform ridge regression in the case that all of the variables are actually truly part of the model (which is not common in practice) then it will not work so well. Maybe this is what Adrian meant by "the effects are zero on average, and they tend to be small" (I do not know of cases for which that is exactly true. But there are many cases in machine learning where we feed a lot of parameters, and where many are probably not needed, then most of the effects are zero or small.) – Sextus Empiricus May 24 '19 at 10:08

Why does shrinkage really work, what's so special about 0?

2 Answers2

Linked