why are models compared over multiple datasets?

Question

I cannot clearly understand why models are compared across multiple datasets. What practical problems do they aim to solve? Especially, in several papers, models are compared using datasets that are not related to the same topic. Is this approach used in real industrial applications?

Some papers:

Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The Journal of Machine learning research, 7, 1-30.

García, S., Fernández, A., Luengo, J., & Herrera, F. (2010). Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Information sciences, 180(10), 2044-2064.

Garcia, S., & Herrera, F. (2008). An Extension on" Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons. Journal of machine learning research, 9(12).

Welcome to Cross Validated! Could you please give some citations for papers that do this? — Dave, Oct 16 '23 at 14:26
Thanks! Some papers: Statistical Comparisons of Classifiers over Multiple Data Sets (Demsar 2006), An Extension on “Statistical Comparisons of Classifiers over Multiple Data Sets” for all Pairwise Comparisons (Garcia and Herrera, 2008), Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power (Garcia etc. 2010) — a_burkley, Oct 16 '23 at 14:51
Could you add the full citations, such as from Google Scholar, as an edit to your original question? I'm thinking of something like what I did at the end of my answer here. — Dave, Oct 16 '23 at 14:53

Stephan Kolassa · Accepted Answer · 2023-10-16T15:30:19.110

When we study a classification, prediction, forecasting or any other method, we are typically less interested in whether it works well on one specific dataset. Rather, we want to understand whether it works well on new datasets.

If you buy a new car, you do not only care about how it handles on the dealer's yard, because you presumably want to drive it elsewhere, too. So you take it on a test drive. You will not be able to test drive it over all roads you will ever drive it on, but you will at least try driving it under different conditions and on different roads, in order to learn whether your candidate car "generalizes well".

And yes, this is absolutely used in industrial applications. I forecast retail sales for a living. When we consider tweaking our algorithms, I try very hard to get multiple datasets to try the proposed tweak on before I unleash it on all customers out there.

EDIT in response to an edit in the question: of course this only makes sense for datasets which are "appropriate" to the question. Whether a given dataset fulfills that condition depends on the particular situation. If your task is "classification of animals", you should apply your method to pictures of dogs, cats and fish. If your task is "classification of dog breeds", the cats and fish are less interesting. (You should still be interested in what your method does with irrelevant data, though.)

Thanks for the answer. But, when we compare two or more classification algorithms using statistical analysis on one dataset, we can conclude that this outcome can be generalized to the population, can't we? So, why do we compare these algorithms across multiple datasets related to different domains? — a_burkley, Oct 18 '23 at 12:11
We can conclude (more or less) that we can extend to the population that our dataset is representative for. But the population of dog pictures may be quite different from the population of animal pictures. And datasets' representativeness is a very thorny issue, how do you know whether a dataset is truly representative of a population? See this recent thread: https://stats.stackexchange.com/q/628727/1352 — Stephan Kolassa, Oct 18 '23 at 14:20

why are models compared over multiple datasets?

1 Answers1