4

I am trying to make an argument that if my field collected larger samples, we would be able to make better models with higher predictive accuracy. However, there's also the possibility that we are reaching an asymptote because the quality of the data is relatively noisy.

Is there an appropriate way to estimate how much more I could improve classificaton acuracy if I were to increase my sample? I gave this a shot by taking random subsamples of my data for training and testing (non-overlapping) using from 20-100% of the original sample. Thus, I simulated smaller subsamples.

From that, I found that it indeed seems that accuracy increases a lot at first and then asymptotes. Such that If I were to double my dataset, it would only buy me a small increase in accuracy.

Anyone have any thoughts if what I did was valid?

aleph4
  • 141

1 Answers1

1

To train your classifier, you are probably estimating a parameter $\hat \theta$, which is a function of your data $\mathbf X$. Then you apply some decision rule using $\hat \theta$. With infinite sample size, then $\hat \theta$ should typically converge to some limit $\theta_0$. So if $\hat \theta$ is already "close" to $\theta_0$ then clearly more data cannot improve your classifier.

How can you tell if $\hat \theta$ is close to $\theta_0$? Well, if your estimation procedure is likelihood-based, you could evaluate the asymptotic standard errors on $\hat \theta$ using the Fisher Information. If not, you could bootstrap.

This assumes you understand how your decision rule depends on $\hat \theta$. If you don't have a good feeling for how $\hat \theta$ affects the decision boundary, then evaluating the variance of the decision rule, or possibly the mis-classification rate through bootstrapping seems sensible. Note with the latter you'd want to be careful with how your test sample is chosen, because different procedures will have different interpretations (sample risk with fixed sample, population risk with varying sample). In any case if the bootstrapped variance of the mis-classification rate is small compared to the absolute rate then you've got good evidence that sample size is not your problem.

Andrew M
  • 2,953