1

I have to plan a study in which I will have to create a classifier.

The output variable is a binary with an estimated proportion of value 1 in the overall population of interest to be 0.10 (and then the proportion of value 0 is 0.90).
So its an unbalanced sample.

I will have less than 40 features to add in the classifier and I will try several algorithms as SVM, Random Forest, CART, logistic regression, adaboost,...

How can I calculate the smaller number of observations needed to maximize my classifier performance?

Marion H.
  • 65
  • 8

1 Answers1

2

There is no simple answer to your question. You will need to make some assumptions. If there are a few predictors that strongly predict the outcome, you will only need a small sample. If you need many predictors and can still only weakly improve on random guessing, you will need a large sample. If your predictors are not related to the outcome at all, your sample size does not matter.

Note that this is equivalent to the assumptions in power analysis for linear models, specifically the assumption on effect sizes.

Your best bet is likely to make "reasonable" assumptions based on your subject matter knowledge, then simulate data, run your envisaged analysis on the simulated data, and check the outcome. Then systematically vary your assumptions.

Incidentally, don't use accuracy as your measure for classification performance.

Stephan Kolassa
  • 123,354