Neural Networks: what's the point of learning features that don't linearly separate?

Question

Unless I'm mistaken, deep neural networks are good for learning functions that are nonlinear in the input.

In such cases, the input set is linearly inseparable, so the optimisation problem that results from the approximation problem is not convex, so it cannot be globally optimised with local optimization. Support Vector Machines (try to) get around this by choosing features such that the projection of the input space into the feature space is linearly separable, and we have a convex optimisation problem once again.

I don't suppose that a deep neural network always learns features that make the projection of the input set in the feature space linearly separable. So what's the point in it learning features?

score 3 · Accepted Answer · answered Feb 07 '14 at 13:33

3

In theory, when you use a non-linear SVM you might be motivated by choosing a space in which the classes are linearly separable. However, nothing about recent SVMs requires separability: slack variables allow for misclassification of some examples at the expense of a better solution overall, as measured by the loss function and margin. Convexity of the optimisation problem has nothing to do with the linear separability.

Multi-layer neural nets allow for the same idea: the non-linearity allows for a more flexible separation, as measured by the loss function and regulariser. It allows you to capture (induced, not necessarily real) class boundaries that are not linear in the input space.

answered Feb 07 '14 at 13:33

Ok, thanks. I thought that finding the optimal linear classifier (regardless of whether training set is linearly separable) was a convex optimisation problem. If not, what determines the convexity of an optimisation problem in machine learning? – Feb 07 '14 at 17:41
Convexity means that all local minima are global. Informally: once you've found a minimum of the objective, you've found the minimum. This makes the optimisation problem a lot easier to solve and you can get away with much more naive methods. – Feb 10 '14 at 10:11

score 1 · Answer 2 · edited May 23 '17 at 12:35

This depends on the activation function in neural network. A two-layer neural network with a single output node and Sigmoid activation function at decoding stage is equivalent to an SVM classifier with a Sigmoid kernel regarding the form of function to be learned. So the neural network, as @BenAllison mentioned, also gets you the boundaries that are non linear in the input space. On the other hand, the perception rule only works for the data that are linearly separable (which is somehow equivalent to SVM with linear kernel) since the perception is only a linear combination of input and weights (before the sign is applied), but without the Sigmoid decoding stage.

Note that the difference between a SVM and a Neural network is in how they decide what these parameters should be set to.

Neural Networks: what's the point of learning features that don't linearly separate?

2 Answers2