11

In cs231n course , it is mentioned that

If the initial weights are too large then most neurons would become saturated and the network will barely learn.

How do the neurons get saturated? Large weights may lead to a z (output of saturation) which is not cery close 0 or 1 and so doesn't let z*(1-z) to saturate

MysticForce
  • 213
  • 1
  • 2
  • 6

1 Answers1

12

The sigmoid function $$ \theta(z) = \frac{1}{1+e^{-z}}$$ looks like this :

enter image description here

where $$z=w_i a_i + bias$$ for activations $a_i$ from the previous layer, and weights $w_i$ of the current neuron.

When the weights $w_i$ are too large (positive or negative), $z$ tends to be large as well, driving the output of the sigmoid to the far left (value 0) or far right (value 1). These are saturation regions where the gradient/derivative is too small, slowing down learning.

Learning slows down when the gradient is small, because the weight upgrade of the network at each iteration is directly proportional to the gradient magnitude.

MohamedEzz
  • 1,951
  • 1
  • 13
  • 7
  • "When the weights wi are too large (positive or negative), z tends to be large as well". Even if the weights are large isn't there a fair chance where z could be small(because of ai)? for instance if w1=1000 and w2=1000 and if a1=1 and a2=-0.99 then z=10 – MysticForce Feb 12 '17 at 11:17
  • I understand the problem here now . For the layers deep down in a neural network , ai's are positive because sigmoid gives a positive output from previous layer.This forces z to be large. So this problem is not solely about the weights but also the positive output of sigmoid.Thanks for the insight – MysticForce Feb 12 '17 at 11:21
  • As a continuation of discussion, we could also solve this by changing the mean of sigmoid to zero by z-0.5 as output of neuron. Why isn;t this done usually? – MysticForce Feb 12 '17 at 11:22
  • It is solely about the weights. The problem is not about the sigmoid mean, it's about the sigmoid limits at inf and -inf. It's true that outputs of the sigmoid are always positive, making it spit +ve and -ve outputs does not help : Still, a very large weight (+ve or -ve) multipled by an activation (+ve or -ve, small or large) will produce a large number (+ve or -ve).....this goes through the sigmoid producing an output in a "flat" area with a tiny gradient. – MohamedEzz Feb 12 '17 at 13:50
  • What does saturation mean in this context? I always see people explaining this phenomenon using the word saturate itself. Even when the question at hand is asking what people mean by saturate. – Nick Corona Feb 09 '20 at 16:24
  • 1
    A variable "Saturating" means it's slowly approaching its maximum (or minimum) possible value. From the plot above, you see sigmoid has min=0 and max=1. It comes close to "saturation" when z is below -5 or above 5.

    The issue is that the "saturation" behavior (i.e., slow increase/decrease) means that the gradient is small. Note that gradient is "the rate of change". Saying "the rate of change is slow" is equivalent to saying "small gradient value"

    – MohamedEzz Feb 11 '20 at 05:13
  • I have read the consequence of zero gradient being described as 'prevent the gradient from flowing backward, and prevent the lower layers from learning useful features'(Glorot&Bengio,2010). Does it mean the saturated activation function produces zero gradient wrt its input, and thus zero error function gradient, and the updating of the weights will halt? If we are using minibatch learning, new inputs into the model do not update the parameters anymore. Is that what it means by 'prevent from learning useful features'? Thanks in advance. – siegfried Aug 23 '21 at 05:38
  • correct. To be precise it will not be exactly zero, but gradient becomes very small, especially in the early layers in deeper nets. – MohamedEzz Aug 23 '21 at 13:52