3

I want to know why we only use non-linear functions ? Why not use linear counterparts instead ?
I have read somewhere that, this is the non-linearities that give the network its depth (linear functions can't do that). I don't understand this.
Can some one explain this to me and give an intuition on this?

Hossein
  • 2,385
  • 3
    What do you mean by the "linearities"? Do you mean linear activation functions? If so, think about what happens when you compose multiple linear functions: the result is still linear. So there is no point in having a deep linear neural network. It is only when you compose nonlinear functions that there is value added by adding multiple hidden layers. – jld Mar 07 '16 at 15:30

3 Answers3

5

If you don't use non-linearities between the layers, you can only construct linear functions.

Now consider the XOR problem. XOR

These points cannot be separated by any linear function. You can only separate them using an MLP where the hidden layer has non-linear activation functions

EDIT: I forgot to say that linear activation functions are usually used in output neurons when the objective is regression instead of classification. This is because in regression you want an unbounded output response

4

Most successful/popular deep network architectures perform end-to-end training: we simultaneously learn a feature representation and a classifier. Typically the classifier is a linear classifier, like a softmax classifier, or an SVM classifier. So in order to these sorts of architectures to perform well, the corresponding feature representation portion of the network, which is essentially everything but the last classification layer, must be able to linearly separate the raw input data. Or try to separate the raw input data as best as it can.

In general, it need not be this way: you can have a nonlinear classifier with linear features, nonlinear feature representation with nonlinear classifiers, etc. However there seems to be a very large emphasis on learning feature representations and empirically we have seen lots of breakthroughs on various benchmark tasks (ImageNet, Microsoft COCO, Pascal). It appears that investing most of the computational budget to learning good feature representations has been very effective and seems to be a standard practice (as of now).

For many learning tasks, particularly those in the "AI realm" (natural language processing, computer vision tasks, speech recognition, etc), the raw features are "highly entangled" and in particular are not able to be separated by a linear transformation.

Just like @Net_Raider mentioned in their answer: a composition of linear maps is a linear map, so there is no real notion of "depth" with linear networks. In the above context: by only using linear layers, we are only able to learn linear transformations. For complex data, this is simply not sufficient to linearly separate the data and learn good features. In the current "standard" supervised deep net architecture, this is bad since we ultimately are learning a linear classifier on top of our learned features.

Using non-linearity now allows us to have a real notion of depth and also allows us to learn non-linear transformations. Now we may learn a non-linear transformation which maps our raw input features into a space where they are linearly separable.

keramat
  • 211
Indie AI
  • 7,002
  • do you know anything about this? :http://stats.stackexchange.com/questions/208817/what-are-the-examples-where-parameter-sharing-make-no-sense-in-convolution-netwo – Hossein Apr 22 '16 at 17:54
0

Consider a linear node $i$ in a network layer $k$: $$a_i^k=\sum w_{ij}^kx_j^k$$ where $x_j^k$ are inputs, and $a$ is an output, and $w$ - weights. This can be written in a matrix form: $$A^k=W^kX^k$$

In a next layer, the outputs will become inputs $x_j^{k+1}=a_j^k$, and its outputs are: $$a_i^{k+1}=\sum w_{ij}^{k+1}a_j^k = \sum w_{ij}^{k+1} w_{jm}^kx_m^k $$ or in a matrix form $$A^{k+1}=W^{k+1}X^{k+1}=(W^{k+1}W^k)A^k$$

As you can see all that the next layer has accomplished is to multiply the weight matrices of layer while the final output is still the linear combination of inputs of the first layer.

That is why a linearity will never be deep, in a sense that it will not model nonlinear processes.

Aksakal
  • 61,310