0

A question related to whether a statistical significance test makes sense when you are testing an algorithm on same data (i.e. not doing any sampling).

I have an algorithm that receives some data as input and its output can be evaluated in terms of accuracy. There's many variations of this algorithm and I'd like to know which one is best in terms of accuracy.

The only way I can currently test the algorithm is on some data that I collected from my users a while ago (1000 data points). This means that same 1000 data points will be used for evaluation each time I try to test a variation of my algorithm.

So given the following evaluation results obtained using same 1000 data points:

  • base algorithm: 71% accuracy
  • treatment algorithm: 73% accuracy

I'd like to know how certain can I be that the treatment algorithm is really better. Does doing a statistical significance test make sense here given that I'm testing both base and treatment algorithms on all the data that I have (i.e. I'm not doing any sampling). To me it seems like my entire dataset can be treated as the entire population and thus a statistical test makes little sense. If my assumption is incorrect and I instead should be doing statistical significance testing, what test would make sense in my case?

  • 1
  • I had a good read and it's indeed an interesting post, but I wasn't able to answer my question with it. I guess my question boils down to the fact that I sampled 1000 data points once, and now I'm testing all my algorithms on that very same dataset. So the question is: does the fact that I use same data points for each validation test make my dataset the population meaning a statistical test makes no much sense. – wanttoaskstupidquestions Mar 04 '22 at 11:47
  • 2
    Why not test your algorithm on random samples of your dataset? This permits you to conceive of your dataset as a population (whose characteristics might be similar to future populations to which the algorithm will be applied) and enables you to compare the algorithm's results for any subsample to the properties of the held-out sample. This procedure, when applied in an automatic but principled way, is generally known as cross-validation -- and you can read a tremendous amount about it here on Cross Validated! – whuber Mar 04 '22 at 14:25
  • If I understand you correctly, the data that you have are not the population, as you would like to make generalisations to other data. The question then is whether the current data set can be treated as i.i.d. sample from the population you want to generalise to. In case this is so, yes, you can test the null hypothesis that the accuracy of the two algorithms is the same on the data that you have. – Christian Hennig Oct 07 '22 at 00:20

0 Answers0