If p > n, the lasso selects at most n variables

Question

One of the motivations for the elastic net was the following limitation of LASSO:

In the $p > n$ case, the lasso selects at most n variables before it saturates, because of the nature of the convex optimization problem. This seems to be a limiting feature for a variable selection method. Moreover, the lasso is not well defined unless the bound on the L1-norm of the coefficients is smaller than a certain value.

(http://onlinelibrary.wiley.com/doi/10.1111/j.1467-9868.2005.00503.x/full)

I understand that LASSO is a quadratic programming problem but also can be solved via LARS or element-wise gradient descent. But I do not understand where in these algorithms I encounter a problem if $p > n$ where $p$ is the number of predictors and $n$ is the sample size. And why is this problem solved using elastic net where I augment the problem to $p+n$ variables which clearly exceeds $p$.

If lasso restricts use to keeping p<=n why is that a drawback rather than a virtue. overfitting is a serious problem that comes about when p=n. The model with p=n is a saturated model and often that model overfits because it will fit the observed data perfectly but not necessarily predcit future cases well. — Michael R. Chernick, Sep 30 '12 at 12:07
That the lasso selects only up to $n$ variables can be seen as a consequence of the fact that it can be solved using (a slight modification of) the LARS algorithm, which only admits up to $n$ variables into the active set at any one time. That this does not hold in the elastic-net case essentially follows from the incorporation of the $\ell_2$ penalty and so behaves more like ridge regression, the latter of which normally results in all coefficients being nonzero. — cardinal, Sep 30 '12 at 13:57
Thank you for the answers, and how would I see for gradient descent that at most n variables can selected:
Presentation at http://www.cs.cmu.edu/afs/cs/project/link-3/lafferty/www/ml-stat2/talks/YondaiKimGLasso-SLIDE-YD.pdf Paper (section 4) at http://datamining.dongguk.ac.kr/papers/GLASSO_JRSSB_V1.final.pdf — user1137731, Sep 30 '12 at 21:53
@user: I think you may be conflating the mathematical problem with its numerical solution. The LARS algorithm shows that the lasso solution will select at most $n$ variables. This is independent of the actual numerical means for arriving at the solution, i.e., the LARS algorithm gives the insight about the problem, but of course any other method that equivalently solves the problem must have the same property! :-) — cardinal, Oct 01 '12 at 00:29
Consider a feature duplicated $p$ times. There will exist a lasso estimator with exactly $p$ nonzeroes (even if $p>n$) Therefore your statement isn't true as written. — user795305, Sep 14 '17 at 02:55

score 16 · Answer 1 · edited Oct 01 '12 at 11:58

16

As said, this is not a property of an algorithm but of the optimization problem. The KKT conditions basically give that for coefficient $\beta_j$ to be non-zero it has to correspond to a fixed correlation with the residual $|X_j^t(y-X\beta)| = \lambda$ ($\lambda$ is the regularization parameter).

After resolving the various complications with absolute value etc, you are left with a linear equation for each non-zero coefficient. Since the rank of the matrix $X$ is at most $n$ when $p>n$, this is the number of equations that can be solved, and therefore there are at most n non-zeros (unless there are redundancies).

By the way, this is true for any loss function, not only the standard lasso with $L_2$ loss. So it is in fact a property of the lasso penalty. There are many papers that show this KKT view and the resulting conclusions, I can point to our paper: Rosset and Zhu, Piecewise Linear Regularized Solutions Paths, Annals of Stats 2007 and refs therein.

edited Oct 01 '12 at 11:58

user1137731

505

answered Oct 01 '12 at 08:26

Saharon Rosset

161

1

What does KKT stand for? Also, is it possible you mean L1 loss when talking about the standard lasso? – miura Oct 01 '12 at 08:58
Hi Saharon and welcome to the site. You can use LaTeX to make formulas neater (I did so in your answer) and you don't need to sign your posts, as a signature is added automatically. – Peter Flom Oct 01 '12 at 10:58
1

@miura: KKT stands for Karush-Kuhn-Tucker. The KKT conditions are certain equations that solutions to (sufficiently regular) optimization problems must fulfill (wikipedia article). – mogron Oct 01 '12 at 11:05
I just see that Ryan Tibshirani has a very relevant working paper 'The Lasso Problem and Uniqueness.': http://www.stat.cmu.edu/~ryantibs/papers/lassounique.pdf – user1137731 Oct 08 '12 at 02:40

score 8 · Answer 2 · answered Jul 28 '16 at 10:05

Another explanation is the following: if $n < p$, the rank of the data matrix $X$ is at most $n$, so the dimension of its (right) null space is at least $p - n$. Write any vector in this null space as $z$. Then at any feasible point $\beta$, one can always move in this $p - n$-dimensional null space towards the coordinate axes of the $p$-dimensional ambient space, to arrive at a $\beta+z$, where (at most) $n$ $\beta_j$s are nonzero, and the LASSO objective function

has decreased.

(+1) There's a gap here: see my comment on OPs post. – user795305 Sep 14 '17 at 02:59 — user795305, Sep 14 '17 at 02:59

If p > n, the lasso selects at most n variables

2 Answers2

Linked