2

I have a computer science background but I am trying to learn how to apply ML by solving small problems.

I have been working on this problem for the last couple of days and I cannot find a solution. I have a dataset with just 10 samples (5 belong to class A and 5 to classB) and 30000 features. I can reduce the number of features (~100) and I would like to use random forest algorithm to identify the most important features among those 100.

I split the dataset into train and test set (test_size=0.20, so it is even smaller than the initial one). Unfortunately (and as expected) I have the overfitting problem. I tried to tune the model using different parameters (max_depth, n_estimators, min_samples_leaf, criterion) + GridSearchCV. However, I still get 100% accuracy. Is there anything else I can try?

Thank you in advance for you help

Dave
  • 62,186
pingu87
  • 21
  • 1
  • 1
    Your feature to observation ratio is hopeless even if the data were generated from a linear model with no noise when P is 30,000. It's still pretty tough when P is 100. – John Madden Oct 28 '22 at 17:10
  • 2
    Since you are working on this as a learning exercise, another comment: a small dataset for model development is a big problem for data-driven methods. So actually you didn't choose a small problem. There are lots & lots of tutorials and examples online; I suggest you start with some of those. – dipetkov Oct 29 '22 at 13:00

1 Answers1

2

Unfortunately, this problem is all but hopeless for a random forest approach. You do not have enough data to do much of anything, let alone a complicated model like you have selected.

If you just want to learn the mechanics of implementing a random forest in software without having to concern yourself with excessive training time due to a large data set, then this is fine, but you should not expect good predictions from your data or an ability to identify important features with any kind of confidence in your identification.

How to know that your machine learning problem is hopeless

The first paragraph and image posted by Gung apply here (unless you just want to work on the software mechanics), even if that was posted in a rather different context.

Dave
  • 62,186