0

I have an imbalanced time series dataset for use in a time series forecasting problem for regression (forecast 1 video of 24 hour data (144 7x7 images) given a 1 video of 24 hour data (144 7x7 images)). I did a test, in which I filtered the data from the training set, so that the training target data was equal to or less than the mean of the validation target data, this made the loss (Gradient Difference Loss + MAE) and metrics (RMSE and MAE) of training, validation and testing sets become much better in the early epochs since the first epoch, it was a very noticeable change. I understood that this way the training data is more balanced, since the data starts to have a distribution more similar to the validation data and consequently to the test data. From what I verified, this is a sampling technique, I believe it would be downsampling. I would like to know if this approach is valid, and what would be others solutions. Python code that I use to downsample:

mean_Y_val = np.mean(Y_val)
mask_Y_train = np.mean(Y_train, axis=(1, 2, 3)) <= mean_Y_val
Y_train = Y_train[mask_Y_train]    
X_train = X_train[mask_Y_train]
Marco
  • 51
  • Hi, I edited the post to clarify the part you asked about. I meant that the loss and metrics from the first epochs onwards (since the first epoch) improved well. Regarding the previous post, I took a look, but it seems that the focus is on a classification problem, in my case it is a regression problem. Thanks. – Marco Mar 18 '24 at 17:54
  • 2
    @Marco if you artificially change the distribution of the training data, your model will not be a model of the operational data. It is not clear why the imbalance is a problem in a regression setting. Filtering data in the training set to match the validation set in some way seems bound to make the validation set performance better if the metric is related to the change, it is less clear why the test performance improves. Are you confident that all three sets come from the same underlying distribution? – Dikran Marsupial Mar 20 '24 at 17:39
  • Thinking about it, if the mean of the targets are substantially different in the training and validation sets, that suggests they are not from the same distribution? – Dikran Marsupial Mar 20 '24 at 17:40
  • Hi @DikranMarsupial, yes, the three sets come from the same underlying distribution, is a sequential time series, and I just collect a first sequential group for the training set, other sequential group for the validation set and finally other sequential group for the test set. There is a increase in the performance in the validation set and consequently in the test set because the distributions of the validation and test sets once not aligned, will be aligned. Or am I wrong? Thank you very much for your help. – Marco Mar 20 '24 at 21:34
  • About that the model will not be a model of the operational data, I think you're talking about that the model will be biased after these alignments, you're right, but what really matters at the moment is a good data prediction on unseen data (in this case, test set), every time you want to predict there would always be a need to train the data to make the proper alignment with the data latest validation and/or testing, to improve the model, understand? Maybe I can just retrain the model, but these operational issues are not important at the moment I believe. – Marco Mar 20 '24 at 21:40
  • @DikranMarsupial for you understand better my time series data, they are Total Electron Content maps, they are related to the solar cycle, the data range from 2008 to 2018, the validation and test data are towards the end of the data, the validation data start in mid-2016 and the test data start in the beginning of 2018, so if you see the sunspot number progression plot that relates to the solar cycle (https://www.swpc.noaa.gov/products/solar-cycle-progression), you will see that my test data is closest to solar minimum from the beginning of the training data only. – Marco Mar 20 '24 at 21:53
  • More info about the solar cycle: https://www.swpc.noaa.gov/phenomena/sunspotssolar-cycle. – Marco Mar 20 '24 at 21:54
  • I had asked this question on the TensorFlow forum too https://discuss.tensorflow.org/t/solutions-to-downsampling-imbalanced-time-series-dataset-in-time-series-forecasting-for-regression-model/23427, there was an answer, but I wanted a more robust answer, so I asked here at Cross Validated. – Marco Mar 20 '24 at 21:59
  • One more piece of information, although I don't know if it's relevant. I use sliding windows on the data, at each time step I move a TEC map forward. And I also use gaps in which I separate 576 sequences (videos) between each set (between training and validation, and between validation and testing, or if I just use training and testing, between training and testing). – Marco Mar 20 '24 at 22:05
  • @Marco, if you only have a couple of cycles you are not going to be able to produce a useful model as sunspot cycles have long term trends in them. Also your validation and test sets are not from the same distribution as the training set as they only cover the tail end of one cycle, but the training set covers the start and middle of the cycle. You need to build a model over multiple cycles and use whole cycles for the validation and test sets so they are all from the same distribution. – Dikran Marsupial Mar 21 '24 at 11:33
  • The reason it improves performance is simply because it biases the prediction towards low values, like you find in the tail ends of the cycle, but this is effectively "peeking" at the test set to make choices about the model, which invalidates the evaluation. – Dikran Marsupial Mar 21 '24 at 11:35
  • @DikranMarsupial So do you mean that I would have to have a larger amount of data so that the validation and test sets end up having the same distribution as the training set? Would I be able to adjust my data from 2008 to 2018 so that my training, validation and test sets have the same distribution? – Marco Mar 21 '24 at 13:47
  • @DikranMarsupial about "the reason it improves performance is simply because it biases the prediction towards low values, like you find in the tail ends of the cycle, but this is effectively "peeking" at the test set to make choices about the model, which invalidates the evaluation", why it invalidates the evaluation? I know it wouldn't be an operational model because the model only predicts, if it has good results, for a certain test set, but why isn't it an invalid model? What would you tweak in the data to improve performance? Thank you very much! – Marco Mar 21 '24 at 13:51
  • One more important note: I may not be able to obtain more data, how do I resolve the situation with the data I have? – Marco Mar 21 '24 at 13:59
  • @Marco it invalidates the test because you (the researcher) have already seen the validation and test sets and know the values in both are typically lower than the training set. So if you remove patterns with high values from the training set, you are leaking information from the validation and test sets into the design of the classifier. This is known as "researcher bias" and can be very difficult to quantify and deal with in research. – Dikran Marsupial Mar 21 '24 at 14:21
  • If you can't get data from earlier solar cycles, there probably isn't much you can do. You might be able to get some data from models (but you may as well use the model). You might also use e.g. sunspot numbers as a proxy (but you may not have enough data to convert sunspot numbers to the actual data that you need). Consider what you might predict for the validation and test data if you had only seen the training set and knew nothing about solar cycles (and had never seen the whole sunspot sequence), if there isn't enough data to characterise the system, predictions will be very uncertain. – Dikran Marsupial Mar 21 '24 at 14:25
  • @DikranMarsupial about the "researcher bias" I understand that it implies not making the model general for any test data, as you already said when talking about the operational model. But let's suppose I want to predict rain for tomorrow, and that I have data since 1500, but I see that if I align my training data with data from the last few months (my test data) my prediction gets much better, what's the problem with me doing that? My prediction objective will be satisfied, I will have a good prediction, isn't that enough? – Marco Mar 21 '24 at 15:04
  • In that case you would be making the change before seeing the validation data from tomorrow. In this case you are already aware of what the solar cycle was doing before making the change, so you would know beforehand that it would make the results better. Basically the test data should be used only to evaluate the model - there should be no means by which information about the test data can leak back into the design of the model. Consider having another "test set" which is taken from the top of the next solar cycle. The adjustment you have made will make it worse for the 2nd test set – Dikran Marsupial Mar 21 '24 at 15:09
  • @DikranMarsupial about getting more data, could you please elaborate on what you said about using sunspot number data as a proxy? – Marco Mar 21 '24 at 15:11

0 Answers0