Difference between Stochastic Approximation (SA) and Stochastic Gradient Descent (SGD)

Question

I understand the intended use cases for both stochastic approximation algorithms like SPSA or FDSA, and for SGD algorithms like Adam. SPSA is intended for noisy objective functions, and Adam for randomized mini batches.

So for me it looks like the only difference between both of them is where the randomness comes from. You can also use SPSA for mini batches and you can use Adam for noisy objective function. And both of them can be used for mini batches where each training example is noisy by itself.

So with that said they both look very semilar in their application, yet in the literature they are always kind of seperate from each other.

Why is that? What fundamental difference distinguishes them from each other?

SPSA computes gradient using finite differences, whereas typical SGD computes it using backprop. Finite difference is a stochastic approximation to backprop and quite inefficient because number of samples needed to approximate the gradient scales in proportion to dimensionality of the problem. For SGD, you can use either SPSA or backprop to get the gradient, the former is used in OpenAI's "Evolutionary strategies" paper and https://arxiv.org/abs/2202.08587 — Yaroslav Bulatov, Dec 08 '22 at 23:25

ForceBru · Answer 1 · 2022-12-08T23:03:22.803

According to (Kushner & Yin, sec. 1.1.3), stochastic approximation and stochastic gradient descent are the same thing (emphasis mine):

...stochastic approximation form of linearized least squares:

$$\theta_{n+1} = \theta_n + \epsilon_n \phi_n [y_n - \phi_n' \theta_n] \quad (1.18)$$

...The mean ODE, which characterizes the asymptotic behavior of (1.18) is $$\dot{\theta} = -\frac12 \frac{\partial}{\partial \theta} \mathbb{E}[y_n - \phi_n' \theta]^2, \quad (1.19)$$ which is asymptotically stable about the optimal point $\overline{\theta}$.

The key to the value of the stochastic approximation algorithm is the representation of the right side of (1.19) as the negative gradient of the cost function. This emphasizes that, whatever the origin of the stochastic approximation algorithm, it can be interpreted as a "stochastic" gradient descent algorithm. For example, (1.18) can be interpreted as a "noisy" gradient procedure.

...the driving observation in (1.18) is just a "noise-corrupted" value of the desired gradient at $\theta=\theta_n$.

According to (Bottou et al., sec. 3.2):

The prototypical stochastic optimization method is the stochastic gradient method (SG) [130], which, in the context of minimizing $R_n$ and with $w_1 \in \mathbb{R}^d$ given, is defined by

$$w_{k+1} = w_k - \alpha_k \nabla f_{i_k}(w_k). \quad (3.7)$$

...the stochastic and batch approaches mentioned here have analogues in the simulation and stochastic optimization communities, where they are referred to as stochastic approximation (SA) and sample average approximation (SAA), respectively.

Reference 130 is the classic paper by Robbins and Monro which introduced stochastic approximation.

Description of Table 2 in (Toulis et al.) says:

Modern procedures, such as SGD, are instantiations of the classical Robbins-Monro procedure (Robbins and Monro, 1951).

This Cross Validated answer says that "Stochastic Gradient Descent is preceded by Stochastic Approximation as first described by Robbins and Monro".

It looks like stochastic gradient descent is either an instance of stochastic approximation (since SA was proposed first) or the same thing, since both deal with the same problem (solving $\mathbb{E}_X h(\theta, X)=0$ for $\theta$) using the same approach (iterations that involve a stochastic estimate of $\mathbb{E}_X h(\theta_n, X)$).

References

Robbins, Herbert, and Sutton Monro. 1951. "A Stochastic Approximation Method." The Annals of Mathematical Statistics 22 (3): 400–407. https://doi.org/10.1214/aoms/1177729586.
Kushner, Harold, and George G. Yin. 2003. Stochastic Approximation and Recursive Algorithms and Applications. 2nd ed. Vol. 35. Stochastic Modelling and Applied Probability. New York: Springer, New York, NY. https://doi.org/10.1007/b97441.
Bottou, Léon, Frank E. Curtis, and Jorge Nocedal. 2018. "Optimization Methods for Large-Scale Machine Learning." arXiv. http://arxiv.org/abs/1606.04838.
Toulis, Panos, Thibaut Horel, and Edoardo M. Airoldi. 2020. “The Proximal Robbins-Monro Method.” arXiv. http://arxiv.org/abs/1510.00967.

SPSA gives a way to approximate SGD gradient by using finite differences instead of backprop. You may need as many samples as there are dimensions to get an accurate estimate, but memory requirement is much lower. Recently this made an appearance here -- https://arxiv.org/abs/2202.08587 — Yaroslav Bulatov, Dec 08 '22 at 23:20

Difference between Stochastic Approximation (SA) and Stochastic Gradient Descent (SGD)

1 Answers1

References