0

Take traditional Ridge regression,

$$ Y_i = \sum_{j=0}^m \beta_{j} X_{i,j} + \epsilon_i $$

we minimize

$$ L_{ridge} = \arg \min_\hat{\beta}(\lambda||\beta||_2^2 + ||\epsilon||^2)$$

where $\lambda$ is the regularization penalty.

Suppose instead our model we wrote our model as

$$ Y_i = \sum_{j=0}^m \beta_j X_{i,j} + \sum_{j=1}^n \beta_{m+j} I_j $$

where $I_j=1$ if $j=i$ and 0 otherwise. In other words, the errors become additional parameters, one for each observation. Now we minimize

$$ L_{ridge} = \arg\min_\hat{\beta}(\lambda||\beta_{1..m}||_2^2 + ||\beta_{m+1,...,m+n}||_2^2)$$

In this case, assuming standardized coefficients, shouldn't it be clear that these "error"/ residual parameters should be treated as any other; i.e. $\lambda = 1$?, so just

$$ L_{ridge} = \arg\min_\hat{\beta}(||\beta_{1..m+n}||_2^2)$$

I see this answer here, but if the data is standardized, I don't see why these error parameters should have a different weight? (Maybe they need to be standardized too.)

(Equivalent question for Lasso with Least Absolute Deviations.)

Sycorax
  • 90,934
dashnick
  • 161
  • 8
  • Shouldn't the coefficients be $\beta_j$ instead of $\beta_{i,j}$ (unless these are random effects)? –  Jun 23 '22 at 00:13
  • Right, fixed, thanks. – dashnick Jun 23 '22 at 00:28
  • Typically our goal is predictive accuracy or accuracy via a cross-validation method of one sort or another; there is no reason to believe that $\lambda=1$ is optimal for out-of-sample accuracy measures. Furthermore, even with respect to your in-sample argument, the "error" parameters are not real errors, they are just an artifact of a particular formulation of the problem. Generally, we don't want to treat formulation artifacts identically to how we treat real, in this case, errors. – jbowman Jun 23 '22 at 00:28
  • @jbowman maybe "errors" is misleading, "residuals" if you like. But anyway seems arbitrary that just because these parameters only apply to a single observation they are treated differently. Imagine there was a categorical variable that only applied to 2 records.. this parameter would show up with the lambda. But categories with 1 record are fundamentally different? – dashnick Jun 23 '22 at 00:42
  • It's not an observation at all, it's an artifact of how we write out the problem. There is no collection of $m$ data points with dependent variable values $=0$ and each of the $m$ independent variables with value $1$ exactly once. Given that they aren't real, observed, data points, why should I treat them as if they were? – jbowman Jun 23 '22 at 01:51
  • "shouldn't it be clear that these "error"/ residual parameters should be treated as any other" This is not clear at all, why should you treat $\lambda = 1$? Just because you were able to relabel the $\epsilon$ as $\beta$. – Sextus Empiricus Jun 23 '22 at 05:39

1 Answers1

0

$\lambda$ doesn't equal $1$ because it is defined as $\sigma^2/\tau^2$ where the $\sigma$ is the observation standard deviation and $\tau$ is the standard deviation of the prior on the coefficients $\beta_j$. Remember the point of ridge regression is to impose a penalty for the coefficients to become too large in magnitude, i.e. a prior $p({\beta})=\prod_j\mathcal N(\beta_j|0,\tau^2)$ is placed on the coefficients. In general when we place a Gaussian prior on parameters of a model to encourage them to be small this is called $\mathscr l_2$ regularization or weight decay.

$\lambda$ is greater than or equal to $0$ with larger values meaning larger precision $1/\tau^2$. Since the prior of $\beta_j$ has mean $0$, this pulls the coefficients closer to $0$ giving them smaller magnitudes.