Very deep models involve the composition of several functions or layers. The gradient tells how to update each parameter, under the assumption that the other layers do not change. In practice, we update all of the layers simultaneously.
— Page 313, Deep Learning, 2016.
Do we violate this assumption in practice? If so, what are the consequences of this violation? One consequence is that we cannot guarantee that updating all parameters in a single step will move us in the direction of the steepest descent. Even if we have superb data in great amounts, the loss function is convex and the single gradient step is calculated based on all the samples. Is that correct? This is because simultaneously updating all parameters does not take into account their dependence on each other, correct?