2

I understand the curse of dimensionality, and in machine learning at least, have heard that a minimum of 100-500 samples per class label is needed to effectively train an algorithm (leaving aside single shot learning techniques in development).

Is there a formula that provides guidance on an acceptable number of dimensions given a data set size, dimension size, and the levels of each dimension? I haven't found one but that seems like that sort of formula should exist.

I could use that formula to determine if dimensionality reduction was needed, and also avoid DR by leaving out features that are uncorrelated with the target variable.

skeller88
  • 289

2 Answers2

1

It's mostly guesswork as far as I can tell, but here's a good place to start: estimate the total information content of your dataset (not the sample size!) and compare it to the information content of your model parameters. Your model parameters had better contain much less information if you want to avoid overfitting.

  • This makes sense. Are there specific formulas that you typically use? Your answer led me to some helpful articles on information theory, but no specific libraries. – skeller88 Apr 02 '20 at 16:23
  • No, there are not specific formulas as far as I have ever been able to tell, unfortunately. – Dave Kielpinski Apr 02 '20 at 18:28
0

*This would be better suited as a comment but I can not do that yet.

What I like to use as a rule of thumb, without really knowing anything domain specific about the data, is to have atleast as many examples as features. Of course this is not very helpful most of the time, but atleast the model has the potential to use every feature.

Jurkis
  • 1