7

Following regularized minimization problem $$\min f(x) + \lambda g(x)$$ where $\lambda>0$, and following constrained minimization problem $$ \min f(x) $$ s.t. $$ g(x) \leq \epsilon $$ where $\epsilon>0$, can be converted to each other so that they have the same solution.

Examples are a $l^1$-regularized linear least square problem and a BPDN problem.

Generally speaking, which of the two problems is easier to solve? Which one do people prefer for modelling in their applications? Thanks and regards!

Tim
  • 1,281
  • 1
  • 12
  • 27
  • What's your criteria for "easy"? Implementation or computational expense? – Paul Apr 07 '13 at 16:42
  • @Paul: yes, I think it is what you said. Also are there considerations like convenience for analysis (not necessarily for solving). – Tim Apr 07 '13 at 16:49
  • I imagine you mean to use $f+\lambda ([g-\varepsilon]^-)^2$ where $[\cdot]^-$ equals the value of the argument if it is negative, and is zero otherwise! – Wolfgang Bangerth Apr 07 '13 at 18:25
  • @WolfgangBangerth: Thanks! I wonder why you change it that way? I think my post is same as this reply http://math.stackexchange.com/a/336618/3348. – Tim Apr 07 '13 at 19:52
  • Because the two formulations you reference do not coincide with the same problem. Or is your goal to solve the problem $\min f(x)$ so that $h(x)=0$ where $g(x)=h(x)^2$? Even in that case, the correct constrained formulation would not have $g(x)\le \varepsilon$ but $h(x)=0$ as constraint. – Wolfgang Bangerth Apr 07 '13 at 22:23
  • @WolfgangBangerth: I am not sure if I understand your comment. This post http://math.stackexchange.com/a/336618/3348 shows that the two problems in my post are equivalent. Or is it wrong? – Tim Apr 08 '13 at 10:25
  • For one particular choice of $g(x),\lambda,\varepsilon$ the two problems are the same, yes. But the way you formulated you make it sound as if this were the case for any $g$ and a particular combination of $\lambda,\varepsilon$. But that's not true. To give just one example, take $f(x)=x^2, g(x)=x$, then the first problem has its solution at $x^\ast=-\lambda/2$ whereas the other one has its solution at $x^\ast=0$. – Wolfgang Bangerth Apr 09 '13 at 02:48

2 Answers2

10

Without knowing more about $f$ and $g$, there really isn't going to be a clear winner here. Sometimes, the regularized form is going to be easier; sometimes, the constrained form is going to be easier.

Let me dispatch some easier cases. First of all, if both $f$ and $g$ are smooth and have readily computable derivatives, then I'll definitely give the edge to the regularized form, because that will give you an unconstrained smooth problem. If you're using an interior-point method to solve either problem, then there should be no consistent advantage for either form.

Here's an interesting case. Consider $f$ smooth, with a readily computable gradient. Consider two different possibilities for $g$: $$g_1(x)=\|x\|_\infty \quad\text{and}\quad g_2(x)=\|x\|_1$$ Which will be easier: the regularized form or the constrained form? I argue the answer is not the same for both.

A first-order method is going to be a logical choice here, which will involve alternating between standard gradient steps on $f$ and simple nonsmooth minimization steps on $g$. For the regularization form, the $g$ steps will be prox minimizations $$ x = \mathop{\text{arg}\,\text{min}}_z g(x) + \tfrac{1}{2} \|x^{(+)}-z\|_2^2. $$ For the constrained form, the $g$ steps will be projections $$ x = \mathop{\text{arg}\,\text{min}}_{z: g(z)\leq \epsilon} \tfrac{1}{2} \|x^{(+)}-z\|_2^2. $$ In each case, $x^{(+)}$ is the point obtained from taking the gradient step.

We implemented both projections and both proximity functions in our software TFOCS. It turns out that for $g_1(x)=\|x\|_\infty$, the projection is the easiest to implement; and for $g_2(x)=\|x\|_1$, the prox minimization is the easiest. In these two easy cases, the problem decouples---each variable can be determined separately. In the opposite cases---prox minimization for $g_1$, projection for $g_2$---there is coupling across variables.

I believe our implementations are $O(n)$ for the easier cases, and $O(n\log n)$ for the harder ones. To be fair, neither cost will be significant if you're solving a problem on a single machine where the gradient steps dominate. But what if you're solving a very large-scale problem on a cluster, using distributed computation to compute the gradient? Then you're going to prefer the decoupling that the $\|x\|_\infty$ projection and the $\|x\|_1$ prox provide.

One case I haven't considered is the one where $f$ is nonsmooth and $g$ is smooth. If the regularized form is simple to compute---because $f(x)$ admits a simple prox or projection---then yes, I'd give the edge to that form. But if you can compute the dual of the constrained problem, it is likely to take on a regularized form as well! So there is a real possibility that you could construct an algorithm to solve the constrained form with similar efficiency.

So in summary, it really depends on the specific characteristics of $f$ and $g$, the kinds of algorithms you're willing to deploy, and how much work you're going to do to extract efficiency. (EDIT: I'm assuming convex problems throughout here. But if $f$ is non-convex but smooth and $g$ is non-smooth and convex, then most of what I've said above about first-order methods will still apply! Of course you can't guarantee global convergence anymore.)

Michael Grant
  • 2,062
  • 11
  • 24
5

Having dealt with quite a number of nonlinear optimization over the years, my take on this is that the penalized formulation is obviously much simpler to implement. If you're interested in a quick solution for an easy problem, that's the way to go. Historically, this is also the approach that has been taken in the early years of computational optimization.

That said, the constrained formulation is much more powerful; for example, it can find minima that actually lie at the boundary of the feasible set. The reason why this formulation is not always used is that there is a much higher threshold of entry: it requires a significant effort in implementation and it also requires a significant mathematical machinery to develop efficient numerical methods to solve the linearized problems you have to deal with in every step of, say, a Newton method. In other words, the constrained formulation requires much more work to make it work efficiently; however, it then clearly outperforms penalized methods on all but the simplest problems.

Wolfgang Bangerth
  • 55,373
  • 59
  • 119