I cannot clearly understand why models are compared across multiple datasets. What practical problems do they aim to solve? Especially, in several papers, models are compared using datasets that are not related to the same topic. Is this approach used in real industrial applications?
Some papers:
Demšar, J. (2006). Statistical comparisons of classifiers over multiple data sets. The Journal of Machine learning research, 7, 1-30.
García, S., Fernández, A., Luengo, J., & Herrera, F. (2010). Advanced nonparametric tests for multiple comparisons in the design of experiments in computational intelligence and data mining: Experimental analysis of power. Information sciences, 180(10), 2044-2064.
Garcia, S., & Herrera, F. (2008). An Extension on" Statistical Comparisons of Classifiers over Multiple Data Sets" for all Pairwise Comparisons. Journal of machine learning research, 9(12).