Dropout implies stochastic descent?

Question

The question is very simple, yet I can't find a quick confirmation on the web. It might seem obvious - by design, will Dropout always result in stochastic-looking gradient descent? (SGD)

I've built a system which converges nicely with momentum at 0 and learning rate 0.01, even with 100 layers stacked. With dropout, the error decreases, but jumps up and down due to dropout knocking-out certain neurons?

Is it usual to see fluctuations of error during backprop with dropout, similar to SGD?

Would that imply I can train on huge batches instead of minibatches, without the fear of overfitting by default? For example, using Resilient Propagation (rProp) - in my test-example the LSTM trains in just 40 iterations using rProp, instead of 1000 using SGD

Two questions. 1) what disadvantages do you see in using mini-batches with SGD? 2) what advantages do you think there are by training with the entire dataset? — horaceT, Mar 01 '18 at 23:37
Well, although stochasticity can be beneficial in climbing out of local minimums when using small minibatches, large minibatches are better in terms of accuracy. Full-batch will for sure lead us directly along the true gradient with less oscilations because it sees the whole picture at once. However takes proportionally longer to train, although could make use, of say, iRprop+ full-batch-algorithm. — Kari, Mar 02 '18 at 04:12

score 2 · Answer 1 · answered Mar 01 '18 at 22:05

It depends a bit on your definition, usually the stochastic part of Stochastic Gradient Descent refers to the fact that you sample mini batches and estimate the true gradient with this sample. Dropout adds stochastic behaviour by sampling masks that comes down to sampling new network architectures every time. You could certainly see this as a form of stochastic gradient descent but whether this is the case or not depends solely on the definition.

Dropout implies stochastic descent?

1 Answers1