Understanding concepts of regularization

Question

I am trying to understand regularization in machine learning. But, I do not understand some fundamental concepts in this topic, could you please explain?

A model that has high variance, captures noise/randomness in the data, this translates to an increase in coefficients. Why does a high variance model, say: $y_h$ have the value of coefficients greater than a low variance model, say: $y_l$ ? An example in linear regression model according to what I interpret is:
$y_h = 2 + 13x_1 + 51x_2 $
$y_l = 2 + 3x_1 + 5x_2$ Is this interpretation correct? Could you explain this increase of coefficients with noise?
Continuing with linear regression example, to minimize the coefficients which capture noise, a shrinkage quantity is added to the RSS (residual sum of squares) in ridge regression
$ \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right) + \lambda \sum_{j=1}^p \beta_j^2$ = $RSS + Shrinkage $
the $\beta_j$ in the RSS is not alone but accompanied by the $x_j $, then why is $\beta^2 $ added to RSS and not $(x_j \beta_j)^2$ ?
Why is the intercept $\beta_0$ not shrunk?

StatsML · Accepted Answer · 2019-06-23T19:11:07.830

The impact of noise in the predicted variable is different from the noise in predictor variables.

If the predicted variable ($y$) is noisy, then, it increases the variance in the coefficient estimates. What this means is that our coefficient estimate will vary quite a bit if we draw different samples from the population. This is the variance problem which people often talk about. This problem is exacerbated if you have too many independent variables in the linear regression (hence a very flexible model which starts fitting the sample noise in y).
If the predictor variables ($x$) are noisy, then, it would lead to consistently smaller estimates of the coefficients. This is called attenuation bias.

Regularization solves the first problem by biasing the coefficients to zero (by putting a penalty for non-zero coefficients). The penalty shrinks the coefficient towards zero and hence shrinks the impact of the noise on the parameter estimate. This ensures that a non-zero slope coefficient is possible only if there is strong evidence of a relationship. Mathematically, one can show that this equivalent to assuming a prior on the slope coefficients - for ridge regression, it's a normal prior with mean zero and certain variance (which gets determined by the penalty parameter in the regularization). In essence, we are saying that before seeing the data, our belief is that the slope coefficients are zero, and, we will change our belief only if the data gives strong evidence in favor of non-zero coefficients.

The Intercept parameter is set by the constraint that our prediction, $\hat{y}$ should have the same mean as $y$. It's not affected by a noise which has a mean$ = 0$. That's why we don't regularize it.

Understanding concepts of regularization

1 Answers1

Linked

Related