Comparing multiple algorithms pair-wise with Vargha-Delaney A measure

Question

From a set of $N$ algorithms I want to determine which perform statistically better than others (for some metric), and communicate effectively how much better they perform than one another. This is with the aim to rank them/select the best one.

1 (significance test): With the Mann-Whitney U-rank test I determine which pairs are statistically significantly different. I use a non-parametric test since I do not know anything about the distribution of the metric, and it seems more robust than an ANOVA (Pair-wise Mann-Whitney U vs. ANOVA).

2 (effect size): For those pairs that were significantly different, I can compute the Vargha-Delaney A (VDA) measure to report the effect size for each pair.

Problem: The above would give me a matrix of effect sizes (blank for those comparisons that were not statistically significantly different) which is not ideal for communicating results, especially if we have such a matrix for every dataset.

Question: Suppose I have a simple baseline algorithm, can I compute the VDA for each of the N algorithms with this baseline and then sort them according to size? This would be an easier way of communicating the results than a matrix. If not, is there another measure for which we can do this?

In the original paper 1 the A measure is defined as $A_{AB} = P(X_A > X_B) + \frac{1}{2}P(X_A=X_B)$, where $X_A$ and $X_B$ would be the performance of algorithms $A$ and $B$ respectively. I don't see why $A_{AC} > A_{BC}$ would imply $A_{AB}>0.5$, which would mean that the answer is no.

Counter example: I think I can easily give a counter example even. Consider the following metrics obtained for algorithms $A,B,C$ and we use $C$ as a baseline. $X_A=\{1.4,1.5,1.6,1.7\},X_B=\{3.0,2.5,2.4,0.9\}, X_C=\{1.1,1.2,1.3,1.0\}.$ We can see that $\hat{A}_{AC}=1.0$ and $\hat{A}_{BC}=0.75$. However, we have that $\hat{A}_{AB}=0.25$ so smaller than $0.5$.

But are there occasions when we can say this, e.g., when they are normally distributed? Is there a different measure for which $M(A,C)>M(B,C)$ implies $M(A,B)>0.5$, so that I can better communicate a ranking for these algorithms?

In general, VDA or Wilcoxon-Mann-Whitney don't follow a transitive principle. I wrote an example here @ the Schwenk Dice phenomenon, where essentially A>B, B>C, but C>A. rcompanion link ... I don't think this phenomenon underlies your question, however. In your example, B tends to have the highest values, so M(A,B) or M(C,B) will be < 0.50. The fact that M(A,C) is > M(B,C) logically doesn't affect this. ... You could rank your competing algorithms against a baseline algorithm, as you suggest, perhaps including a confidence interval. — Sal Mangiafico, May 06 '22 at 17:03
any reason you dont just want to say how much better the #1 algo is than the #2? Or how much better #1 algo is than the mean of other algos? — John Madden, May 06 '22 at 17:04
P.S. Search for Schwenk at the previous link, if interested. — Sal Mangiafico, May 06 '22 at 17:04
@SalMangiafico Thank you for the reply. I have browser through the link you gave. Ranking + confidence interval seems like a good approach too. From the website, am I correct that it would be better to do a Kruskal–Wallis test, and if signficant, a Dunn test to find which algorithms differ from each other (I will need to adjust the p-metric)? Is this better than using the pair-wise Mann-Whitney U-rank test? — Richard Schoonhoven, May 06 '22 at 17:12
@JohnMadden This is indeed what I ultimately want to say, but in a way that is statistically sound. I do not want to falsely claim that algorithm A is the best when there is in fact no statistical difference. Secondly, when there is a difference, there is the question of reporting mean/median. Also this may change depending on the metric. It could be runtime of the algorithm in one case, but "fraction of optimal fitness" found in a second case (which is a proportion). — Richard Schoonhoven, May 06 '22 at 17:17
@JohnMadden This is why I am looking into effect size measures. I hope to find a robust measure that allows us to draw these conclusions reliably irrespective of the type of metric. My question than relates to not wanting to create big NxN matrices for every dataset, hence a transitive property would be very useful. — Richard Schoonhoven, May 06 '22 at 17:18
Yes, I would use Dunn 1964. The advantage over pairwise Mann-Whitney tests is that the Dunn retains all the ranks from the full data set. ... If you have relatively similar sample sizes, the results of the pairwise Dunn test can be used to rank the algorithms. The effect size statistics could be presented also. ... Yes, I think that p-values are usually adjusted with the Dunn test, but what method, if any, you want to use depends on your purpose. — Sal Mangiafico, May 06 '22 at 20:01

Comparing multiple algorithms pair-wise with Vargha-Delaney A measure

0 Answers0

Linked