2

Many readers familiar with scientific articles in the field of deep learning-based computer vision might have observed a common practice: the absence of statistical significance tests for comparing the algorithms. Whether the focus is on classification, semantic segmentation, object detection, or other tasks, research papers typically present comparison tables showcasing state-of-the-art approaches versus their own, using various metrics like accuracy, IoU, F1-score, and more. Yet, the application of statistical tests to demonstrate the superiority of one method over another is conspicuously lacking. This raises several questions:

  1. Why is the use of statistical tests infrequent in these articles?
  2. What's the rationale behind reporting performance based solely on the results from the final training epoch?
  3. How can we be certain that the observed differences, sometimes of only a few percentage points, are statistically significant?
  4. Is the absence of statistical tests due to the time and cost associated with repeated model training?
  5. Is it assumed that having a substantial amount of data makes test performance reliable? If so, how much data is considered sufficient to trust the metrics?
  6. Can we estimate confidence intervals by examining the variability of an accuracy metric over time within a single training run, or is it necessary to rely on multiple training runs to obtain this information?
ricber
  • 53
  • 1
    For the question in the title of your post, uncharitable suspicions come to mind. And likely, nobody but the authors of the papers in question can answer. Authors who, I suspect, are not regular visitors here. However, the other questions in your post are quite different from the question in your title, especially questions 2, 5 and 6. Perhaps you want to reduce your scope, and potentially ask additional questions? – Stephan Kolassa Oct 30 '23 at 16:31
  • @StephanKolassa The practice is so common that I suspect there is an obvious motivation that I am missing. I think there is no need for the authors of those studies to respond, which is why I am asking here. – ricber Oct 30 '23 at 16:47
  • @StephanKolassa As regards the other questions, everything starts with the one concerning the title. I hope you can catch a glimpse of the path of my reasoning. Admittedly, my path of reasoning may have led me away from the title question, but I believe that all the questions are somehow related to each other. Would you suggest that I rewrite the title of this question or split this question into several questions? – ricber Oct 30 '23 at 16:50
  • I have to admit that I doubt you will get a good answer to your titular question. I would recommend you split your post into different questions, and you may get something for the others. – Stephan Kolassa Oct 30 '23 at 16:55
  • I suspect that some of it is an assumption that the data sets are so huge that the standard errors will be close to zero (whether than assumption is reasonable or not). – Dave Oct 30 '23 at 17:17
  • 1
    Related: https://stats.stackexchange.com/questions/550308/why-does-machine-learning-have-metrics-such-as-accuracy-precision-or-recall-to/550313#550313 – Sycorax Oct 30 '23 at 17:25
  • 1
    I think this is a duplicate of the link posted by Sycorax. @ricber does that answer your questions? – Dave Oct 30 '23 at 17:35
  • @Dave yes, I have the same suspect. However, I see this practice used even when datasets are neither small nor huge (like hundreds of images). So, I was wondering if there is a rule of thumb to define a huge dataset. Anyway, yes, the link posted by Sycorax answers part of my questions. I couldn't find it via searching, and it did neither appear in the related questions. I will ask the remaining unanswered questions in a separate post. – ricber Oct 30 '23 at 17:43

0 Answers0