From a set of $N$ algorithms I want to determine which perform statistically better than others (for some metric), and communicate effectively how much better they perform than one another. This is with the aim to rank them/select the best one.
1 (significance test): With the Mann-Whitney U-rank test I determine which pairs are statistically significantly different. I use a non-parametric test since I do not know anything about the distribution of the metric, and it seems more robust than an ANOVA (Pair-wise Mann-Whitney U vs. ANOVA).
2 (effect size): For those pairs that were significantly different, I can compute the Vargha-Delaney A (VDA) measure to report the effect size for each pair.
Problem: The above would give me a matrix of effect sizes (blank for those comparisons that were not statistically significantly different) which is not ideal for communicating results, especially if we have such a matrix for every dataset.
Question: Suppose I have a simple baseline algorithm, can I compute the VDA for each of the N algorithms with this baseline and then sort them according to size? This would be an easier way of communicating the results than a matrix. If not, is there another measure for which we can do this?
In the original paper 1 the A measure is defined as $A_{AB} = P(X_A > X_B) + \frac{1}{2}P(X_A=X_B)$, where $X_A$ and $X_B$ would be the performance of algorithms $A$ and $B$ respectively. I don't see why $A_{AC} > A_{BC}$ would imply $A_{AB}>0.5$, which would mean that the answer is no.
Counter example: I think I can easily give a counter example even. Consider the following metrics obtained for algorithms $A,B,C$ and we use $C$ as a baseline. $X_A=\{1.4,1.5,1.6,1.7\},X_B=\{3.0,2.5,2.4,0.9\}, X_C=\{1.1,1.2,1.3,1.0\}.$ We can see that $\hat{A}_{AC}=1.0$ and $\hat{A}_{BC}=0.75$. However, we have that $\hat{A}_{AB}=0.25$ so smaller than $0.5$.
But are there occasions when we can say this, e.g., when they are normally distributed? Is there a different measure for which $M(A,C)>M(B,C)$ implies $M(A,B)>0.5$, so that I can better communicate a ranking for these algorithms?