So basically, I started my internship in business intelligence, and when my boss know that I have a background in machine learning and deep learning. so, he asked me to build a model that predicts a specific number for the next month. so, it time series problem, and the datasets that I have it is very small it starts from May 2019 so it is just 31 rows. And when I plotted the data, it had no clear trend. This pitcher for the graph looks like my dataset
(sorry I cannot share the dataset because of privacy). So, I started to take the difference in the data to remove the seasonality and log transformation and after that, I bullied the model using the Arima algorithm and LSTM, and prophet. And I applied a prediction interval for the predicted number to get periods and expect the number will be inside this interval. But unfortunately, the actual number (for this month) was out of the interval. So, I decided to look back in a database and I found a feature I think that may help and have a high correlation with the main feature now becoming a multivariate time series problem. so, I tried to use the VAR algorithm but unfortunately, the model also filed and the actual number for each feature was out of interval. This first time for me to build a time series model in the industry for a real dataset and I worked alone. So, there is an approach that can help me to build a better model that I do not follow in my step. Or I should go to my boss tell him cannot build a model for this dataset, especially the data is impacted by a coronavirus
Asked
Active
Viewed 107 times
0
-
Hi @user354187. With no rudeness intended, you seem a little lost. Firstly, can you specify which type of data you have, continuous or discrete? This will affect the type of model you require for prediction. Also, in terms of the current models you're using: ARIMA models are for non-stationary problems. Did you check whether your data is stationary? Also, with the point you made regarding adding a high-correlated variable into your multivariate model, this is a bad idea for many reasons. And finally, you mentioned a structural break for COVID. You should run a structural break test. – EB3112 Mar 29 '22 at 12:44
-
Hi @EB3112, thank you for responding to my question. Yes, my data is continuing but I do know how this will affect? And for stationary yes, I check by using adfuller test and found the p-value more than 0.05 .and for the other variable that I added and I found it is highly correlated with the main variable I know in machine learning when having two features are correlated we need to remove one but I think if I added other feature and make the problem multivariate time series and use algorithms like VAR that may help. – user354187 Mar 31 '22 at 07:19
-
Hi @user354187. I just thought if your data was non-continuous, you might need something like discrete event simulation, rather than models designed with continuous data in mind. But now you've clarified continuous and stationary, then ARIMA is not for you. But again, if you go down the route of a VAR, I would check for structural breaks in your data. And moreover, one of the pitfalls of an unrestricted VAR is that they're endogenous and not uniquely identified. They're often not prediction models therefore, but instead, they're often used for impulse response type evaluation. – EB3112 Mar 31 '22 at 08:17
-
@EB3112 If ARIMA is not good for me so what should I use and why is not good. I do know if you see my comment with tim but I tried to covert my data to daily then calculate these days to get the number for a month but also I got bad results – user354187 Apr 03 '22 at 09:04
-
Hi @user354187. You're trying to calculate the number of events (of a particular unnamed event), am I right? – EB3112 Apr 04 '22 at 08:45
-
@EB3112 kind of you can say i want to predict event next month depending on the past events – user354187 Apr 04 '22 at 10:08
-
Hi @user354187. Sorry I've not been particularly active in a while. However, I noticed from beforehand that the p-value on the ADF test actually indicates you have a unit-root non-stationary process. – EB3112 Apr 07 '22 at 12:33
-
@EB3112 Yes you are correct my data is non-stationary and also when I plot pacf I find all the points in the blue region (noise data) so like what I said to Tim I convert my data to daily data and it looks better in pacf. But when I implemented Arima and prophet model and predict the next 30 days and sum the predictions and compare with actual data, unfortunately, the prediction was trouble ( i know is a cross-validation problem but I do not know how to solve it in time series). and i do know if my approach is correct or not. – user354187 Apr 11 '22 at 11:36
-
Hi @user354187. Just to confirm: since you have now identified that your dad is non-stationary, it is has been differenced? Otherwise, cross-validation will be bogus, because your data is not IID. And moreover, can you please identify whether or not your data is actual suitable for differencing? If it is discrete event data, as I suspect, then the would not be a good candidate for differencing. – EB3112 Apr 11 '22 at 11:47
-
@EB3112 yes I took the difference for the monthly data but it does not work well. and for daily data the data is stationary ( sorry if I do not make this clear from the beginning). can you tell me how i know if my data is actual suitable for differencing? – user354187 Apr 11 '22 at 11:56
-
If it is event data, then you be taking the difference of an event. This would make very little sense to me. Typically you'd be looking for a continuous variable at the very least. But I'll admit, it's not a problem I've had to wrestle with myself, so I am not an expert. – EB3112 Apr 11 '22 at 12:03
-
No Problem @EB3112 and thank you for the time that you give to me I really appreciate that. and I am already informed my boss that is difficult to forecast this dataset but I think maybe I miss some things but I don't think so, – user354187 Apr 11 '22 at 12:09
1 Answers
0
First, you seem to have just a few datapoints so you need a very simple model.
The good rule of thumb to answer questions like this is to ask yourself if there are any likely patterns in the data. The general plot doesn’t reveal anything like this, what about if you difference it? How do the plots of STL decomposed data look like? My guess would be that there’s no regularities, so no much to build a model. In such a case, you can treat it as a white noise around the global mean.
See also the How to know that your machine learning problem is hopeless? thread.
Tim
- 138,066
-
Hi @Tim, thank you for your response. Your guessing is correct there are no regularities when I plot STL decomposed and also for autocorrelation all pint is white noise. And thanks for the resources that but in your answer is very helpful. – user354187 Mar 31 '22 at 07:18
-
@user354187 if it solves your question remember to upvote or accept the answer so that others know it is solved. – Tim Mar 31 '22 at 07:31
-
Of course @Tim but I have a question I tried to move data from days instead of months and it looks works good in training but when I tried to predict the (Jan, Feb, and Mar months because I removed it from the data set in training the results for this three month was very bad, is this because same reason and convert to days does not help a lot and also data does not have clear patterns follow?) – user354187 Mar 31 '22 at 11:08
-
@user354187 sounds like a textbook example of overfitting. How to comment on why without access to your data and code etc. If you are using something like a neural network, what you mentioned, it is likely to overfit. – Tim Mar 31 '22 at 11:29
-
yes, @tim I know it is an overfitting problem and I tried to apply cross-validation with the ARIMA model but I got bad results for mean squared error (MSE = 182857). I think it is impossible to get a good model for my datasets. – user354187 Apr 03 '22 at 08:57