14

In lasso, we have a regularization term in the loss function:

$$\sum \|y-\hat{y}\|_{2} + \lambda \sum\|\beta\|_{1}$$

As the loss function is minimized, some $\beta$'s will become zero. That's what people refer to as 'sparsity'

My question is : how to add a hard constraint for 'max number of non-zero beta' , say, 10?

I suppose this is a mixed-integer programming problem: we introduce a temporary variable $s$, which only takes value of $\{0, 1\}$, so we have an extra constraint $\sum s = 10$ . Afterward, we will have $\beta = \beta_{\rm raw} \cdot s$.

Then I got stuck, how to constraint $\beta_{\rm raw}$ ?

Any insight?

eight3
  • 481
  • 2
  • 5

1 Answers1

11

Let $\beta_j$ be the $j$-th regression coefficient and $s_j$ the binary variable indicating whether you are including that term in the regression. To simplify the objective function, add a nonnegative variable $u_j$ representing the absolute value of $\beta_j$. The lasso term in the objective becomes $\sum_j u_j$. Now add the constraints $\beta_j \le u_j$, $-\beta_j \le u_j$ and $u_j \le M_j s_j$, where $M_j$ is an a priori upper bound on $|\beta_j|$.

The tricky part is picking reasonable values for $M_j$. Too small and you risk producing a suboptimal fit; too large and the solver may do funny things. If the problem solves quickly, you might do it by trial and error: pick values; solve; increase the values; solve again; iterate until the regression coefficients do not change from one run to the next. If the solution time is not short, maybe start with a regression with no limit on the number of terms (and no $s$ variables), then guess $M_j$ values from the fitted $\beta_j$.

prubin
  • 39,078
  • 3
  • 37
  • 104
  • 1
    A similar approach is used in this paper. Picking reasonable values for $M_j$ is also discussed. – Kevin Dalmeijer Sep 26 '19 at 12:25
  • 1
    If the objective function contains a ridge regularization term, you could also consider imposing sparsity via second order cone constraints, i.e., replacing $\Vert \beta\Vert_2^2$ in the objective function with $\sum_i \theta_i$ where $\beta_i^2 \leq \theta_i s_i$. This is MISOCP representable via $$\Vert (2 \beta_i, \theta_i-s_i)^\top\Vert_2 \leq \theta_i+s_i.$$ This approach has the benefit of not requiring any big-M constraints. – Ryan Cory-Wright Sep 26 '19 at 15:18
  • 1
    Or if you wanted to get really sophisticated you could also decompose the problem and use a cutting-plane method (this is faster than the MISOCP approach but requires a bit more work). There is a Julia implementation for sparse regression problems available here. – Ryan Cory-Wright Sep 26 '19 at 15:21