0

I'm working on a school project that involves performing backward stepwise regression as a form of feature selection. The dataset in question is 60k images with 700 total columns and is much too large to perform backwards selection on in either Python or R and causes my computer to break out and crash. Because of this, I wanted to try backwards selection by taking random samples of the images. That said, I don't really know how to do this.

My thoughts were basically to take 100 samples of 1k images and perform backwards selection to identify the highest performing models, and then generalizing to the larger dataset. My only problem is, does this make sense? And is there a better/more statistically sound way of doing this? And how do I compare/generalize to the larger dataset? Also, do i need to worry that a sample of 1k images is dangerously close to the total number of columns I have in my dataset (~700)?

jmoore00
  • 369
  • 2
    Backward selection of what, the individual pixels? // Can you not fit the entire model to all 60k images? // Stepwise regression is usually a poor approach, so determining a statistically sound way of doing something that has an iffy statistical basis is challenging. – Dave Feb 11 '23 at 23:28
  • What are you trying to do, inference or prediction/classification? If the former, model selection invalidates all p-values (skim through these threads). – Stephan Kolassa Feb 12 '23 at 06:16
  • Backward selection of the individual pixels, correct. Basically a requirement of the project is to perform backward selection and then compare the classifier to that of the full model. This is for prediction/classification. I understand that stepwise regression is a poor choice, but I think my professor wants us to use it since it is covered in textbooks, similar to what one of the commenters in a post you've shared says. I think since this is for a project, the goal is really just to explore/investigate what happens when using this approach – jmoore00 Feb 13 '23 at 18:09
  • 2
    If all you need to do is satisfy the (not necessarily reasonable) request of the grader, then do whatever the grader wants you to do. // If you do not have the computational resources to do the backward stepwise regression, how do you fit the full model? The first step in backward stepwise regression is to fit the full model are pare down the variable count from there. – Dave Feb 13 '23 at 18:14
  • 1
    correct -- i think the challenge is that the computation for stepwise regression is more memory intensive than fitting the full model, as it it's fitting the full model, storing that in memory, then fitting p - 1 models, for each combination of features, and so on and so forth. this is too much for my computer and why i wanted to explore sampling. i'm also interested in sampling in general since it's an essential technique for big data and thus applicable in other areas as well. this is what lead me to consider an approach that would make use of sampling. – jmoore00 Feb 13 '23 at 18:19
  • The additional RAM required by the stepwise algorithm should be minimal, because models are updated rather than computed anew in each step. Moreover, the size of the storage overall is proportional to the square of the number of variables and sampling the observations won't improve that. I am led to suspect something else is going on if you're having difficulties. Consider benchmarking the RAM usage of your code as a function of the number of variables (starting with smaller values) to see how it scales. – whuber Feb 16 '23 at 23:06

0 Answers0