3

I am using a random forest regressor and I split the independent variables with shuffle = True, I get a good r squared but when I don't shuffle the data the accuracy gets reduced significantly. I am splitting the data as below-

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25,random_state=rand, shuffle=True)

I fit the random forest regressor as below-

reg = RandomForestRegressor()
reg.fit(X_train, y_train)

When I do the cross-validation with unshuffled data, the accuracy is not good, but it improves when the data is shuffled.

This is how the response variable I am trying to predict looks like-

enter image description here

What could be the reason behind that?

lsr729
  • 159

2 Answers2

4

When the data aren't shuffled, the data are split into two parts along the first axis. Observing a large difference in performance between the shuffled and non-shuffled data indicates that something about how the data X, y are sorted is meaningful in some way.

As an example, imagine that your data are sorted by one of the features, and this feature is important for prediction. Without information about that feature in the 25% of the data you've reserved for testing (test_size=0.25), the model can't predict well for those values of the feature.


In a comment, OP has clarified that the data are a time series (yearly data). So what's probably happened is that shuffling the data is letting the model use information from the future to predict the past, essentially giving the model precognition. To avoid optimistic bias, you should split the data so that you're always training the model on data from the past and evaluating it on data from the future.

Sycorax
  • 90,934
  • Thanks for the answer and links. I have a question- what's probably happened is that shuffling the data is letting the model use information from the future to predict the past why does the model care about the order of the data? I am not including the time as independent variable, I am just feeding the model some shuffled data. – lsr729 Mar 24 '22 at 20:44
  • Sure, but even without timestamps, the features vary with time, and knowing the future means you can predict the "gaps" in the time-series more precisely. – Sycorax Mar 24 '22 at 21:13
  • That's true, but how does the model know that the features depend on time? I believe the model should look for only the independent data provided to it in the training. Is it probably because the response variable decreases with time (I have uploaded the plot in the question)? – lsr729 Mar 24 '22 at 21:34
  • 1
    It doesn't know about time, it just knows about things that are correlated with time, and that they're also correlated with the outcome. Random forest particularly is bad at extrapolating outside the range of the training data, so a strong trend in your target means it will be hard for the model to predict the future, but much easier to predict the past if you tell it the future. – Sycorax Mar 24 '22 at 21:50
  • I see. What other model can you like to suggest in this problem? – lsr729 Mar 24 '22 at 21:54
  • Random forest assumes that the data are iid realizations from some data-generating distribution, but your data are arranged in time. You should split your data based on time, as described in the linked threads, and you should use a [tag:time-series] model. – Sycorax Mar 24 '22 at 21:56
  • Thanks for the suggestion, I have put the question in a detailed manner here- https://stats.stackexchange.com/q/569020/275463. – lsr729 Mar 24 '22 at 22:05
4

In our conversation, you revealed two important facts not mentioned in your original question:

  • Your data is (implicitly) indexed by time as it consists of yearly measurements.
  • You are not interested in prediction at all but in estimation.

Large differences in performance with and without shuffling are a hint that the order of the data contains information. Often it's information about time, eg, the order in which the data points occurred or were collected.

Understanding why/how can help you to a) avoid overestimating the predictive performance of the model and, sometimes, b) reconsider how you approach the problem altogether. Your problem seems to be in the latter category.

Don't use data from the future to predict the past.

A common situation where shuffling (before splitting into train/validation/test) is not appropriate is when the data is ordered in time. In that case you want to use the first portion of your data for training and the second portion for validation (the two parts don't need to be the same size). This correctly reflects the fact that the model won't have data from the future when it's deployed in real life.

Since your data has a time index (years), it's not appropriate to shuffle before splitting even if the results are worse. It's a more realistic, if disappointing, evaluation of model performance.

Estimation is not the same as prediction.

If you are interested in estimation, a random forest is not the right approach. You want to specify a flexible functional form for the relationship between features and response and then estimate its parameters. The model also has to take into consideration the serial nature of the observations. There is no need to split the data for training and validation; train on all the data and evaluate how well the model fits instead. Regression with correlated errors might be a good place to start.


Here are some additional thoughts that I leave here because your other question got closed down:

You plot total crop area against time and the most striking pattern is that less land is dedicated to crops since the 70s. Do any of the predictors dropped/rose sharply during that period of time? Was there a large shift from agriculture to manufacturing? Important changes in policy? Do you have any data on economic activity other than agriculture? Should you even lump together data from before and after ~1970s? Maybe not esp. if you don't have data on broader economic conditions apart from agriculture.

dipetkov
  • 9,805
  • Thanks, I have posted the plot of my response variable. my confusion is that I am not doing time series analysis. I am providing some independent data and not a time component, and trying to predict a variable. Now, if the response variable is predictable using the independent data, the order of data that is feed should not matter. – lsr729 Mar 24 '22 at 20:59
  • So here are the two cases you are considering: a) Train the model on data from 1955 to 2010 and check how well it does on 2011 to 2020. b) Give the model a random sample of years and test it on the rest. b) might very well mean that the model "saw" 2010 and 2012 and you ask it to predict 2011. Does b) make any sense to you? – dipetkov Mar 24 '22 at 21:14
  • That's true, b) gives high accuracy and a) gives very poor accuracy. But my understanding is that, if I am providing independent parameters then why the time is important here? – lsr729 Mar 24 '22 at 21:21
  • I guess that makes sense as long as you are not planning to use your model for predicting 2022 because then the model is not going to know what happens in 2021 and 2023. – dipetkov Mar 24 '22 at 21:28
  • I am not making this model for the future prediction, but I would like to change the independent data and see how the response variable would have been if the X_1 is increased by 10% ... – lsr729 Mar 24 '22 at 21:32
  • This is a very different question from the one you actually asked, which describes neither the time series nature of your problem nor what you are hoping to model. – dipetkov Mar 24 '22 at 21:36
  • I am trying to analyze the response variable which is reducing historically. And for that, I have selected 5 explanatory variables that I think are responsible for the behavior of the response variable. I am not going to predict the response variable in the future but would like to do a scenario analysis. – lsr729 Mar 24 '22 at 21:38
  • 1
    I wasn't clear enough. If you'd like to ask that question, that's fine. Just do in a post of its own and provide all relevant context. – dipetkov Mar 24 '22 at 21:43
  • For some reason it's very upsetting that we got to the bottom of the time series nature of your problem because I asked the right question about it but you accepted the other answer. – dipetkov Mar 24 '22 at 22:40