Given that you have an estimated AR coefficient of 0.99, I doubt that your series is really stationary. Be that as it may, you can simplify your problem considerably simply by differencing, then estimating the differenced model. auto.arima in R's forecast package may even do this for you!
# Simulate an ARMA(1,4) process with the ar component = 0.99
tmp <- arima.sim(n=50000, list(ar=0.99, ma=c(-0.3, 0.2, 0.1, -0.1)))
auto.arima(tmp, stepwise=FALSE, max.d=1, seasonal=FALSE")
Series: tmp
ARIMA(0,1,4)
Coefficients:
ma1 ma2 ma3 ma4
-0.3059 0.1896 0.1025 -0.1027
s.e. 0.0045 0.0046 0.0047 0.0045
sigma^2 = 1.018: log likelihood = -71378.48
Observe how, even with 50,000 observations, auto.arima selects a first difference rather than the true model and how the AR term simply disappears. (Of course, this won't always happen - but whatever model is selected, it should be quite close to this when converted to an MA-only model.) Also, note that the MA coefficients are fairly well estimated - no postprocessing is needed. You can do the same by hand by first differencing, then estimating the model on the first differenced series via concatenating the different series and padding with NA values. The following three equations show why this works:
$$\begin{eqnarray}
(1-B)y_t &=& \phi(B)e_t \\
\mathrm{def:}\;\tilde{y}_t &=& (1-B)y_t \\
\tilde{y}_t &=& \phi(B)e_t
\end{eqnarray}$$
In your case, you already have a preliminary estimate of the model, so you have a good idea of how much padding is required.
Here's an example:
series_length <- sample(10:500, size=1000, replace=TRUE)
x <- diff(arima.sim(n=series_length[1], list(ar=0.99, ma=c(-0.3, 0.2, 0.1, -0.1))))
for (i in 2:1000) {
x <- c(x, rep(NA, 8), diff(arima.sim(n=series_length[i], list(ar=0.99, ma=c(-0.3, 0.2, 0.1, -0.1)))))
}
length(x)
[1] 263407
auto.arima(x, stepwise=FALSE, d=0, max.p=0, seasonal=FALSE)
Series: x
ARIMA(0,0,5) with non-zero mean
Coefficients:
ma1 ma2 ma3 ma4 ma5 mean
-0.3023 0.1896 0.0935 -0.1055 -0.0072 0.0010
s.e. 0.0020 0.0021 0.0021 0.0021 0.0020 0.0017
sigma^2 = 1.004: log likelihood = -362990.7
We didn't get the model structure quite right, but the parameter estimates are pretty close and the extra MA term is very small.
Even if the model structure appears quite different than the MA(4) you'd like to see, it's not likely to matter much regarding the actual forecasts. I cannot do better than quote Richard Hardy in his answer to Lag selection and model instability for ARIMA-GARCH in rolling windows for forecasting.
The fact the the selected models change frequently between one window to the next may be due not only to frequent structural changes (which is probably unlikely) but to the fact that there are several models that approximate the patterns in the data about equally well, so their AICs are very close. Then changing two data points out of 1000 (dropping the oldest point and adding one new point) can make auto.arima switch between these competing models. I would not worry too much about that, as each of these models likely implies very similar time series patterns. The are probably almost equivalent representations of the same thing.