There are already multiple questions asking where the lagged error terms in an MA(q) model come from and how the model can be fit without knowing the errors in advance. Textbooks give multiple methods for dealing with this, but I've noticed that earlier texts and even recent formal texts seem to recommend finding the likelihood function for the entire MA(q) or ARMA(p, q) model by setting some initial values for the error terms from $e_0$ to $e_{-q + 1}$. We then iteratively estimate
$$ε_t = y_t - μ - θ_{1}ε_{t-1} - θ_{2}ε_{t-2}-\cdots-θ_{q}ε_{t-q}$$
By setting initial errors (often to 0), everything becomes known except for the parameters, which can then be estimated by numerical optimization. If the model is MA(1), this expands into a tedious yet clean equation that anyone could derive with a pencil and paper:
$$ε_t = (y_t - μ) + (-1)^1θ_1^1(y_{t-1} - μ) + (-1)^2θ_1^2(y_{t-2} - μ) + \cdots + (-1)^{t-1}θ_1^{t-1}(y_1 - μ) + (-1)^tθ_1^tc $$
Where $c$ is whatever we picked for $ε_0$.
This is found, for example, in Hamilton's Time Series Analysis on page 130 (of the 1994 edition). He even includes something like the expansion I wrote above in the $q=1$ case. However, when I tried to derive a similarly clean equation for $q > 1$, it became clear that while it could probably be programmed recursively, it is monstrous to try to write it out.
This approach is also recommended often in texts for fitting ARMA(p, q) i.e. set initial errors, estimate the rest with a variation on the above method, and optimize the resulting huge equation with respect to the parameters.
This contrasts sharply with most blogs, such as this one, which seem to describe something like the Hannan-Rissanen (1982) algorithm. This approach fits an AR model, then uses the errors of the AR model to fit the MA portion. The final ARMA is then just these two models put together. The Hannan-Rissanen in particular follows this approach to fit an ARMA model, then repeats the fitting with new errors until convergence.
The sequential method seems much, much simpler than the method I described above. Estimating parameters in an AR model is simple, and if the lagged error terms are known, then fitting the MA model also becomes simple.
Many of these texts were published in the 50s and 60s when computational power was hard to come by, yet they recommend the more difficult method. And even after Hannan and Rissanen published their paper, many texts and lectures seem to have all but ignored them and continued referencing Box and Jenkins or Hamilton.
Is there any practical reason that one would ever use the simultaneous optimizing method instead of some variant of the Hannan-Rissanen algorithm? What are the relative merits of these two different fitting methods?