For a university paper I want to test a hypothesis that one particular SMOTE variant outperforms two other SMOTE variants. By 'outperforms' I'm looking at using the F1 measure.
I want to test this using multiple datasets (let's say 20 datasets) with varying degrees of class imbalance, in case my SMOTE variant's performance varies with imbalance ratio.
I also want to use three separate classifiers, let's say Decision Trees, Naive Bayes and Random Forest, in case my SMOTE variant outperforms the other two SMOTE variants for one type of classifier, but not others.
As part of this work I will need to test my hypothesis using the concept of statistical significance.
I'm unsure though of how best to design my experiment.
From some background reading I think if I was just using a single classifier, I could use a Friedman test to check for significant differences between the F1 measures for each SMOTE variant and dataset combination. Assuming a significant p-value is found, then I could use a Holm post-hoc procedure to see where the most significant differences occur.
If I have three separate classifiers, is it possible to incorporate that into the Friedman test (and if so, how)?
If not, then is it a valid approach to perform 3 separate Friedman tests - one for each classifier? If so, is there a 'correct' way then to evaluate the outcomes of the three separate Friedman tests as a whole in order to conclude that my variant of SMOTE outpeforms the other two on the chosen metric?
I hope I have asked my questions clearly, but it's my first time attempting this kind of experiment so please be kind!