Why is increasing the non-linearity of neural networks desired?

Question

On the wikipedia page of convolutional neural networks, it is stated that rectified linear units are applied to increase the non-linearity of the decision function and of the overall network: https://en.wikipedia.org/wiki/Convolutional_neural_network#ReLU_layer

Why is increasing non-linearity desired? What effect does it have on the overall performance of the model?

DeltaIV · Answer 1 · 2019-04-04T18:40:08.743

That part of the Wikipedia article leaves a bit to be desired. Let's separate two aspects:

The need for nonlinear activation functions

It's obvious that a feedforward neural network with linear activation functions and $n$ layers each having $m$ hidden units (linear neural network, for brevity) is equivalent to a linear neural network without hidden layers. Proof:

$$ y = h(\mathbf{x})=\mathbf{b}_n+W_n(\mathbf{b}_{n-1}+W_{n-1}(\dots (\mathbf{b}_1+W_1 \mathbf{x})\dots))=\mathbf{b}_n+W_n\mathbf{b}_{n-1}+W_nW_{n-1}\mathbf{b}_{n-2}+\dots+W_nW_{n-1}\dots W_1\mathbf{x}=\mathbf{b}'+W'\mathbf{x}$$

Thus it's clear that adding layers ("going deep") doesn't increase the approximation power of a linear neural network at all, unlike for nonlinear neural network.

Also, nonlinear activation functions are needed for the universal approximation theorem for neural networks to be valid. This theorem states that under certain conditions, for any continuous function $f:[0,1]^d\to\mathbb{R}$ and any $\epsilon>0$, there exist a neural network with one hidden layer and a sufficiently large number of hidden units $m$ which approximates $f$ on $[0,1]^d$ uniformly to within $\epsilon$. One of the conditions for the universal approximation theorem to be valid is that the neural network is a composition of nonlinear activation functions: if only linear functions are used, the theorem is not valid anymore. Thus we know that there exist some continuous functions over hypercubes which we just can't approximate accurately with linear neural networks.

You can see the limits of linear neural networks in practice, thanks to the Tensorflow playground. I built a 4 hidden layers linear neural network for classification. As you can see, no matter how many layers you use, the linear neural network can only find linear separation boundaries, since it's equivalent to a linear neural network without hidden layers, i.e., to a linear classifier.

The need for ReLU

The activation function $h(s)=\max(0,cs)$ is not used because "it increases the nonlinearity of the decision function": whatever that may mean, ReLU is no more nonlinear than $\tanh$, sigmoid, etc. The actual reason why it's used is that, when stacking more and more layers in a CNN, it has been empirically observed that a CNN with ReLU is much easier and faster to train than a CNN with $\tanh$ (the situation with a sigmoid is even worse). Why is it so? There are two theories currently:

$\tanh(s)$ has the vanishing gradient problem. As the independent variable $s$ goes to $\pm \infty$, the derivative of $\tanh(s)$ goes to 0:

This means that as more layers are stacked, the gradients get smaller and smaller. Since the step in weight space of the backpropagation algorithm is proportional to the magnitude of the gradient, vanishing gradients mean that the neural network cannot be trained anymore. This manifests itself in training times which increase exponentially with the increase in the number of layers. On the contrary, the derivative of ReLU is constant (equal to $c$) if $s>0$, no matter how many layers we stack (it's also equal to 0 if $s<0$ which leads to the dead neurons issue, but this is another problem).

there are theorems which guarantee that local minima, under certain conditions, are global minima (see here). Some of the assumptions of these theorems don't hold if the activation function is a $\tanh$ or sigmoid, but they hold if the activation function is a ReLU.

+1 for "since it's equivalent to a linear neural network without hidden layers, i.e., to a linear classifier." as a beautiful summary of the proof above — 3yanlis1bos, Sep 10 '19 at 01:56

score 11 · Answer 2 · edited Feb 15 '23 at 05:50

11

I'll give you a very loose analogy (emphasis is important here) that may help you understand the intuition. There's this technical drawing tool, called a French curve, here's an example:

We were trained to use it in high school in a technical drawing class. These days, the same class is taught with CAD software, so you may not have encountered it. See, how to use them in this video.

Here's a straight ruler:

_{(source: officeworks.com.au)}

Can you draw a curved line with a straight ruler? Of course, you can! However, it's more work. Take a look at this video to appreciate the difference.

It's more efficient to use a French curve to draw curved lines than with a straight ruler. You'd have to make a lot of small lines to draft any smooth curve with the latter.

It's not exactly the same with machine learning, but this analogy provides you with an intuition why nonlinear activation may work better in many cases: your problems are nonlinear, and having nonlinear pieces can be more efficient when combining them into a solution to nonlinear problems.

edited Feb 15 '23 at 05:50

Glorfindel

1,118
2
12
18

answered Mar 21 '18 at 18:49

Aksakal

61,310

liked your example +1. but it assumes we want to draw a curve not a line. BTW, do you mind to make the ruler smaller? – Haitao Du Mar 21 '18 at 18:51
@hxd1011, for a linear model, a "line", there's no need in ML, a simple linear regression will do – Aksakal Mar 21 '18 at 18:52
yes, but many people think linear regression and logistic regression are "machine learning" – Haitao Du Mar 21 '18 at 18:53
@hxd1011, they teach both in ML classes, because the layers in ML are very similar to regression in how they turn inputs into outputs. – Aksakal Mar 21 '18 at 18:57
1

It's a bit more than that. 1) An arbitrarily deep neural network with linear activation functions (also called a linear neural network) is equivalent to a linear neural network without hidden layers. So adding "a lot more layers" ("going deep") doesn't help at all with the approximation power of the linear neural network. – DeltaIV Mar 21 '18 at 19:00
Adding neurons ("going wide") helps only so much. The universal approximation theorem of neural networks doesn't hold for linear neural networks, so we know that there exist continuos functions over hypercubes which we will never be able to approximate to the desired accuracy using a linear neural network, no matter how many layers and/or units we add.

DeltaIV

Mar 21 '18 at 19:00

2

+1 I know there will be technical objections to this answer, yet the image is memorable and more than compensates for the fact it's an analogy. – whuber Mar 21 '18 at 19:07

I'm not sure about this analogy (or maybe I'm misinterpreting it). It seems to me that the ReLU activation function the OP is asking about is more like using a ruler to approximate a curve with small, straight segments, and less like the French curve (because taking linear combinations and compositions of them as in neural nets gives piecewise linear functions). The important thing then would be the fact that you're allowed to make curves that aren't straight overall, rather than the pieces you compose them out of, no? – user20160 Mar 22 '18 at 09:18

@user20160, the fact that ReLU is piece-wise linear is not important in this discussion. what's important is that it's not linear. the fact that it has linear pieces plays a role in terms of computational speed, and subsequently optimization efficiency. – Aksakal Mar 22 '18 at 13:12

noblebadger · Answer 3 · 2018-03-21T20:48:12.617

Why is increasing non-linearity desired?

Simply put: the more 'non-linear' our decision function, the more complex decisions it can make. In many cases this is desired because the decision function we are modeling with the neural network is unlikely to have a linear relationship with the input. Having more neurons in the layers with ReLU, a non-linear activation function, means that the output of the network should have a non-linear relationship with the input. 'Input' in this case is going to be the convoluted image segments.

What effect does it have on the overall performance of the model?

It depends on the problem. Considering CNNs: if the relationship between the class you want to predict and, in this case, convoluted image segments, is 'non-linear' then the performance of the network will improve if the fully connected layers (decision function) have non-linear activation functions (like ReLU). Stacking more layers will also increase your non-linearity.

score 1 · Answer 4 · answered Mar 21 '18 at 18:49

Because linear model has limited "capacity" to perform the task. Consider the data set shown here Why does feature engineering work ?, we cannot draw a line to separate two classes.

On the other hand, using nonlinear transformation (feature engineering), the classification tasks becomes easy.

For neural network, it is usually a very big and complex system that uses nonlinear transformation on original data to achieve better performance.

score 0 · Answer 5 · answered Apr 16 '19 at 08:37

0

it depends on your task. If you are doing processing of linear data (eg. text processing) you actually do not need non-linearity. But most of signal processing (image/audio) tasks are non-linear there you have to have non-linear layers.

answered Apr 16 '19 at 08:37

Andrew Matiuk

131

score -3 · Answer 6 · answered Apr 24 '17 at 04:32

That sounds like it was written by someone who doesn't know what they are talking about. Having a non-linearity is important because it allows the subsequent layers to build off each other. Two consecutive linear layers have the same power (they can represent the exact same set of functions) as a single linear layer. Two consecutive non-linear layers can represent more functions than a single non-linear layer.

Why is increasing the non-linearity of neural networks desired?

6 Answers6

The need for nonlinear activation functions

The need for ReLU

Linked

Related