According to (Kushner & Yin, sec. 1.1.3), stochastic approximation and stochastic gradient descent are the same thing (emphasis mine):
...stochastic approximation form of linearized least squares:
$$\theta_{n+1} = \theta_n + \epsilon_n \phi_n [y_n - \phi_n' \theta_n] \quad (1.18)$$
...The mean ODE, which characterizes the asymptotic behavior of (1.18) is
$$\dot{\theta} = -\frac12 \frac{\partial}{\partial \theta} \mathbb{E}[y_n - \phi_n' \theta]^2, \quad (1.19)$$
which is asymptotically stable about the optimal point $\overline{\theta}$.
The key to the value of the stochastic approximation algorithm is the representation of the right side of (1.19) as the negative gradient of the cost function. This emphasizes that, whatever the origin of the stochastic approximation algorithm, it can be interpreted as a "stochastic" gradient descent algorithm. For example, (1.18) can be interpreted as a "noisy" gradient procedure.
...the driving observation in (1.18) is just a "noise-corrupted" value of the desired gradient at $\theta=\theta_n$.
According to (Bottou et al., sec. 3.2):
The prototypical stochastic optimization method is the stochastic gradient method (SG) [130], which, in the context of minimizing $R_n$ and with $w_1 \in \mathbb{R}^d$ given, is defined by
$$w_{k+1} = w_k - \alpha_k \nabla f_{i_k}(w_k). \quad (3.7)$$
...the stochastic and batch approaches mentioned here have analogues in the simulation and stochastic optimization communities, where they are referred to as stochastic approximation (SA) and sample average approximation (SAA), respectively.
Reference 130 is the classic paper by Robbins and Monro which introduced stochastic approximation.
Description of Table 2 in (Toulis et al.) says:
Modern procedures, such as SGD, are instantiations of the classical Robbins-Monro procedure (Robbins and Monro, 1951).
This Cross Validated answer says that "Stochastic Gradient Descent is preceded by Stochastic Approximation as first described by Robbins and Monro".
It looks like stochastic gradient descent is either an instance of stochastic approximation (since SA was proposed first) or the same thing, since both deal with the same problem (solving $\mathbb{E}_X h(\theta, X)=0$ for $\theta$) using the same approach (iterations that involve a stochastic estimate of $\mathbb{E}_X h(\theta_n, X)$).
References
- Robbins, Herbert, and Sutton Monro. 1951. "A Stochastic Approximation Method." The Annals of Mathematical Statistics 22 (3): 400–407. https://doi.org/10.1214/aoms/1177729586.
- Kushner, Harold, and George G. Yin. 2003. Stochastic Approximation and Recursive Algorithms and Applications. 2nd ed. Vol. 35. Stochastic Modelling and Applied Probability. New York: Springer, New York, NY. https://doi.org/10.1007/b97441.
- Bottou, Léon, Frank E. Curtis, and Jorge Nocedal. 2018. "Optimization Methods for Large-Scale Machine Learning." arXiv. http://arxiv.org/abs/1606.04838.
- Toulis, Panos, Thibaut Horel, and Edoardo M. Airoldi. 2020. “The Proximal Robbins-Monro Method.” arXiv. http://arxiv.org/abs/1510.00967.