First of all I agree 100% with Stephen's answer, I'll just add a little bit from my 2 years of experience!
The ML vs traditional methods IMO boils down to a simple question:
Do you have good drivers to use as variables?
Time series methods work best for time series, of course you can use other factors to aid but with 1 time series going to 1 model you also need to be careful with those features. ML (boosted trees / RFs like you suggest) work best for tabular data where you tend to lose your time series structure so you have to make up for that with good tabular features and simply 'represent' time with other features.
Things like price of products, marketing expense etc. If you don't have these types of variables for your domain then I would bet a decent stat engine outperforms a state-of-the-art ML model in a production setting. That production setting piece is important, with an ML model you have very little control of the actual forecast -you get what you get. A stat engine should allow you to on-the-fly switch to another method if the forecast of the current one is wonky, which leads to my next thought. Just remember though, if you use something like GDP you then probably have to forecast for GDP to use it in the future which is probably very problematic! Or use lagged GDP which may not be as useful.
- What makes a decent stat engine?
Your model portfolio (what you are looking into now) is important but model selection and a business logic layer is everything. For model selection look to time series cross validation. For the business logic layer I would lean on the stakeholders of the forecast. For example, you probably want to assign a 'demand type' for each given time series. Like if 30% or more of the series is 0 then you want to assign it a 'type' which would only allow certain models to be selected such as simple Exp smoothing or croston or mean. An arima may product wonky results in those settings. You could also check to ensure the forecast doesn't go from 5 units to 50 million, something that is possible in an overparameterized arima. You could check to see if there a certain product lifecycles at play, like if there is a build up and fall off over the years and then fit a more local model or weigh the more recent years more if your method takes sample weights. A lot of possibilities here for adding logic that aids the engine.
In summary,
Add some naive methods, you could add some other methods but I personally would stay away from Prophet - autoarima + autoets + naive methods (mean, last period, last seasonal period) will be a good start, take a look at your model selection criteria to ensure it is robust, add some 'logic' to help ensure that the model is appropriate and isn't just merely the one that minimizes some loss function.
But most importantly -
Look at your forecasts.
Setup some quick flags to surface forecasts where the model suggests new maxes/mins or the average of the forecast period is significantly different than the average of the history. Figure out if there are commonalities between these flagged series like a ton of zeros etc. Many times it is just an odd bug in your code meaning that your outlier detection isn't working right or some other issue that causes bad results.
If you have done all of that and want additional models to try my main recommendations would be:
- Theta - there are tons of implementations across python and r. Theta plus auto arima do well in general.
- Croston - pretty standard for intermittent data.
A lot of 'AutoML' time series methods literally try everything under the sun, take a lot of time, and don't add too much value beyond all the methods listed above.
Additionally you could try out some of my personal projects in the field
- ThymeBoost which is just gradient boosted time series decomposition with traditional methods like ETS and ARIMA
- TimeMurmur my newest which does large scale LightGBM time series forecasting, probably wouldn't use it in prod but you could give it a shot as a baseline.