18

There is already a post on this site talking about the same issue: Why does shrinkage work?

But, even though the answers are popular, I don't believe the gist of the question is really addressed. It is pretty clear that introducing some bias in estimation brings in reduction in variance and may improve estimation quality. However:

1) Why the damage done by introducing bias is less compared with the gain in variance?

2) Why does it always work? For example in case of Ridge Regression: the existence theorem

3) What's so interesting about 0 (the origin)? Clearly we can shrink anywhere we like (i.e. Stein estimator), but is it going to work as good as the origin?

4) Why various universal coding schemes prefer lower number of bits around the origin? Are these hypotheses simply more probable?

Answers with references to proven theorems or established results are expected.

  • @KarolisKoncevičius, thanks for fixing the links! Let me note, however, that your language edits might not be very helpful, except for the last one. The other ones seem to add some redundant text and thus make the post slightly less readable. – Richard Hardy May 23 '19 at 13:10
  • 1
  • "what's so interesting about the origin?" how do you understand this statement?. if you have a group factor (eg country) and individual factor (eg city), then shrinkage will put average to country level, and then only city level deviations with enough data will have coefficient) - ie your model is pushed to the group level (country) average (by pushing city level coefficients to zero) ... and similarly for more levels in hierarchies (and multiple hierarchies)
  • – seanv507 May 23 '19 at 13:57