The question is very simple, yet I can't find a quick confirmation on the web. It might seem obvious - by design, will Dropout always result in stochastic-looking gradient descent? (SGD)
I've built a system which converges nicely with momentum at 0 and learning rate 0.01, even with 100 layers stacked. With dropout, the error decreases, but jumps up and down due to dropout knocking-out certain neurons?
Is it usual to see fluctuations of error during backprop with dropout, similar to SGD?
Would that imply I can train on huge batches instead of minibatches, without the fear of overfitting by default? For example, using Resilient Propagation (rProp) - in my test-example the LSTM trains in just 40 iterations using rProp, instead of 1000 using SGD