Why are non zero-centered activation functions a problem in backpropagation?

Question

I read here the following:

Sigmoid outputs are not zero-centered. This is undesirable since neurons in later layers of processing in a Neural Network (more on this soon) would be receiving data that is not zero-centered. This has implications on the dynamics during gradient descent, because if the data coming into a neuron is always positive (e.g. $x > 0$ elementwise in $f = w^Tx + b$)), then the gradient on the weights $w$ will during backpropagation become either all be positive, or all negative (depending on the gradient of the whole expression $f$). This could introduce undesirable zig-zagging dynamics in the gradient updates for the weights. However, notice that once these gradients are added up across a batch of data the final update for the weights can have variable signs, somewhat mitigating this issue. Therefore, this is an inconvenience but it has less severe consequences compared to the saturated activation problem above.

Why would having all $x>0$ (elementwise) lead to all-positive or all-negative gradients on $w$?

I also had the exact same question watching CS231n videos. – subwaymatch Nov 25 '17 at 17:09 — subwaymatch, Nov 25 '17 at 17:09

dontloo · Accepted Answer · 2022-01-09T03:52:57.400

59

$$f=\sum w_ix_i+b$$ $$\frac{df}{dw_i}=x_i$$ $$\frac{dL}{dw_i}=\frac{dL}{df}\frac{df}{dw_i}=\frac{dL}{df}x_i$$

because $x_i>0$, the gradient $\dfrac{dL}{dw_i}$ always has the same sign as $\dfrac{dL}{df}$ (all positive or all negative).

Update
Say there are two parameters $w_1$ and $w_2$. If the gradients of two dimensions are always of the same sign (i.e., either both are positive or both are negative), it means we can only move roughly in the direction of northeast or southwest in the parameter space.

If our goal happens to be in the northwest, we can only move in a zig-zagging fashion to get there, just like parallel parking in a narrow space. (forgive my drawing)

Therefore all-positive or all-negative activation functions (relu, sigmoid) can be difficult for gradient based optimization. To solve this problem we can normalize the data in advance to be zero-centered as in batch/layer normalization.

Also another solution I can think of is to add a bias term for each input so the layer becomes $$f=\sum w_i(x_i+b_i).$$ The gradients is then $$\frac{dL}{dw_i}=\frac{dL}{df}(x_i-b_i)$$ the sign won't solely depend on $x_i$.

edited Jan 09 '22 at 03:52

answered Sep 28 '16 at 03:08

dontloo

16,356

Please correct me if I am wrong but shouldn't the value of dL/df be transpose of x ie x.T since we would be using idea of Jacobin in here. – chinmay Feb 11 '18 at 14:54
@chinmay sorry for the late reply, I think $f$ here is the outcome of $w^Tx+b$ so the value of dL/df does not depend on x, and usually $L$ is a scalar, $w$ and $x$ are 1d vectors, so dL/df should also be a scalar, right? – dontloo Feb 23 '18 at 05:47
Yes, it is a big typo from my end. I meant df/dw .... but I think it depends more on the vector x and if it is a row vector or a column vector – chinmay Mar 28 '18 at 15:46
@dontloo sorry for the very late reply but what is the problem with the gradients having the same sign as $d L/d f$?Why is that a bad thing? – floyd Jul 31 '19 at 19:30
1

@floyd hi I just added some updates for your question – dontloo Aug 01 '19 at 10:31
1

Isn't the argument works only for a specific case (as in picture)? If src is at top right and target is at bottom left (or vice versa) then we will not have zig-zag dynamics right? I could not understand how are we generalizing here? – Vinay Feb 25 '20 at 11:22
@Vinay yes I don't think is a broadly applicable case either, I'm not an expert on optimization methods though – dontloo Feb 25 '20 at 22:27
I do not think this reasoning applies to ReLU. ReLU is generally easier to train than tanh in RNN. remind that its derivative is either 0 or 1. Assuming that the output layer has no activation and the input is normalized between $[-1,1]$, that zigzag pattern if happen, should not be due to ReLU. – Minh Khôi May 30 '22 at 11:47
@dontloo why dont we scale sigmoid to make it zero-centered? subtract 0.5 from sigmoid and it will become zero-centered – Ritwik Jun 06 '22 at 15:54
@MinhKhôi batch normalisation effectively does the zero centring for you. The fact that relu is better than sigmoid or tanh is not because of zero mean, but doesn't mean that relu doesn't benefit from centring. – seanv507 Aug 09 '23 at 09:30
I don't think non-zero-centered activation function is a major problem in backpropagation.
The problem is only most severe when $w$ is initialized all positive, and $x>0$. And while the zig-zag pattern happens initially, when you get close to a local minimum, dying ReLU might actually help reducing it.

Moreover, unless you're applying ReLU after every layer (it's common to avoid activating output layer, right before the loss function), you should be fine. For example, ReLU is actually very common in RNN.
– Minh Khôi Aug 14 '23 at 05:13
Correct if I am wrong, but the zig-zag problem is only for the special case where there are no other hidden layers following the neuron of interest. For example, if a hidden layer is following the neuron of interest, then $dL/df$ is a vector and as such, there might be both negative and positive contributions to the gradient, i.e. $dL/dw_1$ might be negative while $dL/dw_2$ might be positive etc. – ado sar Dec 27 '23 at 15:19
@adosar yes it's possible that the weights receives gradients from other paths, and it's dependent on the optimizer as well, overall I don't think it's a big problem in neural network training – dontloo Dec 27 '23 at 23:01

Why are non zero-centered activation functions a problem in backpropagation?

1 Answers1

Linked

Related