Confusion about Karush-Kuhn-Tucker conditions in SVM derivation

Question

I am currently following CS229 and I'm trying to be diligent, proving most of the things that are not immediately obvious. So I'm looking at the derivation of the SVM as the dual problem to the optimal margin classifier.

Let $x^{i}, y^{i}$ be features and labels, $i = 1, \ldots, m$, the optimal margin classifier problem is then defined as:

$$ min_{w,b} \,\,\frac{1}{2} ||w||^2 \\ s.t. \,\,\, y^{(i)}(w^Tx^{(i)} + b) \ge 1\,\,\,,\,\, i = 1, \ldots, m $$

Now we construct the generalized Lagrangian (that allows for inequality constraints) as:

$$ \mathcal{L}(w,b,\alpha) = \frac{1}{2} ||w||^2 - \sum_{i=1}^m \alpha_i [y^{(i)}(w^Tx^{(i)} + b) - 1] $$

Using the KKT conditions we compute derrivatives w.r.t. $w$ and $b$, substitute them etc. into the formula above, and then construct this dual problem:

$$ max_\alpha \,\,\,\mathcal{L}(\alpha) =\sum_{i=1}^m \alpha_i - \frac{1}{2} \sum_{i=1}^m \sum_{j=1}^m y^{(i)} y^{(j)} \alpha_i \alpha_j (x^{(i)})^T x^{(j)} \\ s.t. \,\,\,\alpha_i \ge 0 \,\,, \,i=1,\ldots,m \\ \sum_{i=1}^m \alpha_i y^{(i)} = 0 $$

I can see how all of the KKT conditions are satisfied in the above problem, except one. And that is this: $$ \alpha_i [-y^{(i)}(w^Tx^{(i)} + b) + 1] = 0\,\,,\, i = 1,\ldots,m $$ In the course material and where I looked on the internet, it is said that all 5 constraints are satisfied in the dual problem stated above. But I don't see, why this constraint could not be violated in the optimization problem. My guess is that if it were violated it implies that the solution is not optimal, and the optimal solution always satisfies this, but I don't see way to prove it. Could somebody please shine some light on this and tell me what I'm missing? Thanks a lot.

EDIT: course materials: https://see.stanford.edu/materials/aimlcs229/cs229-notes3.pdf page 7-11

score 4 · Accepted Answer · answered Feb 26 '20 at 20:15

I've got the answer, thanks to DanielTheRocketMan for providing half of it!

For $x^{(i)}$ that is a support vector, following equality holds:

$$ y^{(i)} (w^Tx^{(i)} + b) = 1 $$

This satisfies the constraint irrespective of $\alpha_i$

For $x^{(i)}$ that is not a support vector, following inequality holds:

$$ y^{(i)} (w^Tx^{(i)} + b) \gt 1 $$

Then in the Lagrangian below (that is being maximised in the dual formulation) it can be seen, that for such $x^{(i)}$ increasing value of $\alpha_i$ would lead to lowering the value of the objective function which we're trying to maximise, therefore the smallest possible value is chosen, thus $\alpha_i = 0$ (there is a non-negativity constraint on $\alpha_i$, see the question above). And so the constraint is satisfied in this case too.

$$ \mathcal{L}(w,b,\alpha) = \frac{1}{2} ||w||^2 - \sum_{i=1}^m \alpha_i [y^{(i)}(w^Tx^{(i)} + b) - 1] $$

Therefore, the constraint holds implicitly for all $i = 1,\ldots,m$ and needs not to be included in the problem definition.

DanielTheRocketMan · Answer 2 · 2020-02-26T20:12:37.103

1

I believe that your problem is that you are not seeing the geometry of the problem. See the figure in the wikipedia.

Since your points are separable, you can always find a vector $w$ and $b$ that ensures this constraint.

Note that $y_i=1$ or $y_i=-1$

You basically must choose $b+w x=0\;\;\; (1) \;\;$ and $\;\;\;\;1/||w||=margin\;\;\; (2)$. See figure.

This ensures the validity of this equation. Note that equation (1) is about choosing the correct slope and equation (2) is about choosing the correct size of the parameters.

edited Feb 26 '20 at 20:12

answered Feb 26 '20 at 19:21

DanielTheRocketMan

1,460

I think, that I see the solution now. For support vectors it holds that $y^{i} (w^T x^{(i)} + b) = 1$ (because $||w|| = 1$ following from the construction of the problem), so there this constraint is obviously satisfied. And for $x^{(i)}$ that is not a support vector, it can be seen from the first Lagrangian formula, that increasing $\alpha_i$ would lead to decreasing the objective function so the $\alpha_i = 0$ is selected as the smallest possible value. Thanks for pointing to the geometric intuition, that helped me find the solution, however it's not a complete answer,so I'll post one below. – Addy Feb 26 '20 at 20:07

Confusion about Karush-Kuhn-Tucker conditions in SVM derivation

2 Answers2