I am trying to understand regularization in machine learning. But, I do not understand some fundamental concepts in this topic, could you please explain?
- A model that has high variance, captures noise/randomness in the data, this translates to an increase in coefficients. Why does a high variance model, say: $y_h$ have the value of coefficients greater than a low variance model, say: $y_l$ ? An example in linear regression model according to what I interpret is:
$y_h = 2 + 13x_1 + 51x_2 $
$y_l = 2 + 3x_1 + 5x_2$ Is this interpretation correct? Could you explain this increase of coefficients with noise? - Continuing with linear regression example, to minimize the coefficients which capture noise, a shrinkage quantity is added to the RSS (residual sum of squares) in ridge regression
$ \sum_{i=1}^n \left( y_i - \beta_0 - \sum_{j=1}^p \beta_j x_{ij} \right) + \lambda \sum_{j=1}^p \beta_j^2$ = $RSS + Shrinkage $
the $\beta_j$ in the RSS is not alone but accompanied by the $x_j $, then why is $\beta^2 $ added to RSS and not $(x_j \beta_j)^2$ ? - Why is the intercept $\beta_0$ not shrunk?