I am studying ARMA processes. At the end of the course the professor told us that estimating the next sample in an arma process using past of length $p$ (so performing a projection of $X_t$ on $\text{span}(X_{t-1}, ..., X_{t-1-p})$ is the same as finding the least squares solution to $X_t = aX_{t-1} + bX_{t-2}...$, which can be just done by accumulating the "samples" as a moving window of length $p$, the prediction length, over the whole signal and then using any way to solve linear regression problems.
My question then is: why do we need the complicated theory of ARMA processes, and solutions involving covariance matrices etc, which are way more complicated to find, when we can just perform a simple linear regression? What's more, a linear regression can simply be extended to a polynomial (or kernel) regression, a model much more powerful than a linear model basically for free, giving a more powerful model than just an ARMA process.
It feels like from a practical perspective, ARMA model is just a complicated way of saying "linear regression on previous samples". Am I missing something? There surely must be an explanation of why we do all of this computation.