2

I've been working with some penalty-based models. Most specifically, LASSO. But I've also been reading surrounding literature on ridge and elastic net.

Conceptually, I understand the nature of a penalty term and the implications. However, I wanted to know if someone could explain why they're also called 'norms,' and the subsequent relation with normed vector spaces etc?

Would be nice to just see a nice relatively non-technical explanation/example, if possible.

EB3112
  • 244

2 Answers2

1

I think that this article answers your question. This kind of models aims at minimising some squared deviations (objective: give the optimisation an incentive to create a model with better prediction accuracy), with constraints (penalty) applied to the norms of the estimated parameters:

$$\min_{\beta\in\mathbb{R}^k} \frac{1}{n}\vert\vert Y-X\beta\vert\vert^2_2+\lambda \vert\vert\beta\vert\vert ^p_p$$

with the norm being defined as a function of $p$ as:

$$\vert\vert\beta\vert\vert_p=\left(\sum_{j=1}^{k}\vert\beta\vert^p\right)^\frac{1}{p}$$

Applying this kind of penalty on the norms of the parameters allows to highlight the importance of the most important regressor, and, with some $p$, to exclude the least important features from the regressors.

Note that LASSO corresponds to $p=1$ and Ridge regression corresponds to $p=2$, i.e. they belong to the same category of model, but use norms with different definitions, which explain why their purposes/results are not the exactly the same even if they share the same general purpose.

Explanation of a norm with $p=2$:

enter image description here

Start with a vector $b=(2, 1)$ and draw it on the plan. By definition, it has 2 elements, 2 coordinates in the plan. Using the norm $p=2$ requires to square each component. It distorts the vector. When you sum these transformed coordinates, you get 5. If you use the inverse function that you used to transform the vector, you get the norm, in that case $\sqrt{5}$. Here, you recognise the application of the Pythagorean theorem. Actually, the norm with $p=2$ is simply the length of a vector in traditional geometry. With $p=1$, you would simply get the sum of the lengths of all the components.

FP0
  • 456
  • Hi @FP0. Thanks for your answer :) It makes more sense now. You've given an expression for the norm. But could you expand on the notion of a norm here more generally? Perhaps if possible, a visual explanation of norms applied to the estimated parameters. – EB3112 Jul 27 '22 at 09:51
  • I tried to give a visual example. A norm is simply some kind of measure of length of a vector. – FP0 Jul 27 '22 at 10:34
  • Thank you @FP0. Very much appreciated :) – EB3112 Jul 27 '22 at 11:37
  • You are very welcome ! When you are satisfied with the answers you received, please remember to accept one of them as an answer. – FP0 Jul 27 '22 at 12:30
1

First of all, not all forms of regularization use norms. For example, dropout or early stopping do not.

As for norms, the term comes from mathematics. In $\ell_1$ and $\ell_2$ regularization, they are exactly the norms applied to the vectors of parameters. If a regular linear regression model is

$$ \underset{\beta }{\operatorname{arg\,min}} \;\| y - X\beta \|^2_2 $$

than ridge regression is

$$ \underset{\beta }{\operatorname{arg\,min}} \;\| y - X\beta \|^2_2 + \color{red}{\lambda \| \beta\|^2_2} $$

so it is about applying the $\ell_2$ norm to the vector of parameters. Norm basically is

The norm of a mathematical object is a quantity that in some (possibly abstract) sense describes the length, size, or extent of the object. [...]

So by adding the "size" of the parameters to the total cost of the function that we minimize, we force the algorithm to find such a solution to the optimization problem where the cost is the smallest, hence where the "size" of the parameters is reduced as well, where by "size" we mean what we used as our norm.

If you ask about the name "norm" itself, the Mathematics Q&A site has two answers on this here and here, and the earliest uses of the term can be traced here. TL;DR it's about the standard units.

Tim
  • 138,066