In the context of LASSO regression, how to introduce a constraint for max number of selected betas?

Question

In lasso, we have a regularization term in the loss function:

$$\sum \|y-\hat{y}\|_{2} + \lambda \sum\|\beta\|_{1}$$

As the loss function is minimized, some $\beta$'s will become zero. That's what people refer to as 'sparsity'

My question is : how to add a hard constraint for 'max number of non-zero beta' , say, 10?

I suppose this is a mixed-integer programming problem: we introduce a temporary variable $s$, which only takes value of $\{0, 1\}$, so we have an extra constraint $\sum s = 10$ . Afterward, we will have $\beta = \beta_{\rm raw} \cdot s$.

Then I got stuck, how to constraint $\beta_{\rm raw}$ ?

Any insight?

I was thinking there beta{raw} is a transitive variable that will be omitted in the end. As you can see, I don't really have a solution for this problem. — eight3, Sep 26 '19 at 06:59

score 11 · Answer 1 · answered Sep 25 '19 at 20:33

11

Let $\beta_j$ be the $j$-th regression coefficient and $s_j$ the binary variable indicating whether you are including that term in the regression. To simplify the objective function, add a nonnegative variable $u_j$ representing the absolute value of $\beta_j$. The lasso term in the objective becomes $\sum_j u_j$. Now add the constraints $\beta_j \le u_j$, $-\beta_j \le u_j$ and $u_j \le M_j s_j$, where $M_j$ is an a priori upper bound on $|\beta_j|$.

The tricky part is picking reasonable values for $M_j$. Too small and you risk producing a suboptimal fit; too large and the solver may do funny things. If the problem solves quickly, you might do it by trial and error: pick values; solve; increase the values; solve again; iterate until the regression coefficients do not change from one run to the next. If the solution time is not short, maybe start with a regression with no limit on the number of terms (and no $s$ variables), then guess $M_j$ values from the fitted $\beta_j$.

answered Sep 25 '19 at 20:33

prubin

39,078
3
37
104

1

A similar approach is used in this paper. Picking reasonable values for $M_j$ is also discussed. – Kevin Dalmeijer Sep 26 '19 at 12:25
1

If the objective function contains a ridge regularization term, you could also consider imposing sparsity via second order cone constraints, i.e., replacing $\Vert \beta\Vert_2^2$ in the objective function with $\sum_i \theta_i$ where $\beta_i^2 \leq \theta_i s_i$. This is MISOCP representable via $$\Vert (2 \beta_i, \theta_i-s_i)^\top\Vert_2 \leq \theta_i+s_i.$$ This approach has the benefit of not requiring any big-M constraints. – Ryan Cory-Wright Sep 26 '19 at 15:18
1

Or if you wanted to get really sophisticated you could also decompose the problem and use a cutting-plane method (this is faster than the MISOCP approach but requires a bit more work). There is a Julia implementation for sparse regression problems available here. – Ryan Cory-Wright Sep 26 '19 at 15:21

In the context of LASSO regression, how to introduce a constraint for max number of selected betas?

1 Answers1