What is the best way of combining multiple rankings into one?

Question

Suppose I have $N$ different items to be ranked, and $M$ different people who provide a ranking of the items according to their personal preference. What is the best way of combining all rankings into one?

The simplest approach would be to compute the sum of all ranks, and then rank the results from lowest to highest. But there are other ways as well. Is there a method which is more mathematically correct, possibly by assuming some underlying distribution?

For instance, let $r_{ij}$ be the rank assigned to item $i$ by evaluator $j$. The simple approach outlined above is equivalent to converting the rank to a score $s_{ij} = - r_{ij}$, and for each item compute a total score $S_i = \sum_{j=1}^M{s_{ij}}$ which is used to provide the final rank of items. Other alternatives include setting the score to $s_{ij}=1/r_{ij}$ (harmonic mean) or $s_{ij}=-\log r_{ij}$ (geometric mean). How can I choose which alternative is best?

Here is an example on how The Guardian ranks the best football players: https://www.theguardian.com/football/2023/dec/18/how-the-guardian-ranked-the-100-best-male-footballers-in-the-world-2023 — Max, Mar 03 '24 at 22:14
"Combining" could mean a huge number of different things and the choice of any of them must depend on why you are making this combination and how you intend to interpret it. Please explain. — whuber, Mar 04 '24 at 13:55
In what sense do you mean "best way"? What are you trying to optimize? — Galen, Mar 04 '24 at 21:04
IMO this seems like it could be an XY-problem as I have trouble imagining most organization/business/research questions being "how do I best combine ranks" per se. Is this narrow question really what you're trying to accomplish, which would be fine BTW, or are you trying to accomplish something else via this aggregation of ranks? — Galen, Mar 04 '24 at 21:23
It depends, not enough context given: are "ratings" subjective (e.g. art, music, film, literature?) or objective? ("Is this a pink/blue/yellow/brown case for an iPhone 14?") Are you assuming all raters are equally reliable, trustworthy and consistent? and that interrater agreement is irrelevant? so that e.g. the ratings 3,3,5 are completely equivalent to 4,4,4? (see Cohen's Kappa) (How are the raters compensated/incentivized/punished? On crowdsourced task sites, do we exclude raters who when through spells of just repeatedly pressing the '3' key to get paid with minimal effort?) — smci, Mar 04 '24 at 22:05
Also, what's the value of N (number of items)? Is it 1..5? or 1..100? Is "1" the best ranking or the worst? Are we only concerned with finding e.g. the "consensus #1/#2/#3" rankings, or with treating all N rankings uniformly? Harmonic mean is in general a bad idea since it will distort very low or very high rankings (esp. in the presence of outliers and interrater disagreement). Post us some sample data to add context to this question. — smci, Mar 04 '24 at 22:12
Thanks for the comments. My motivation is mainly educational, I can imagine several applications such as sports, elections and market research. I've seen different scoring methods used and wondered if there are any tradeoffs involved, or if there is one 'canonical' method under certain mild assumptions. — akvilas, Mar 06 '24 at 10:03
But you haven't answered those though: a) are "ratings" largely subjective or objective? a2) hence is interrater disagreement natural, or a sign of a problem between raters? b) what's the value of N (number of items)? Is it a small number like 1..5? or large 1..100? Presumably "1" is the best ranking not the worst (since you suggested harmonic mean)? c) How many things you pick to be rated matters: in say sports/elections, is it simply a binary "Which team/candidate will win this match/election(/league?)" or is it "Rate the importance of the following 12 political topics/players" — smci, Mar 08 '24 at 23:06
As I said, my motivation is educational and I want to become aware of issues that may suggest the choice of one method over the other. The difference between the methods as long as I can see, is the spacing between final scorings. In most real-world applications I've seen, the scoring difference between first and second is larger than between second and third. I don't know why, and I don't know if any of your mentioned properties would prompt a change in this spacing. — akvilas, Mar 10 '24 at 11:41

score 8 · Accepted Answer · answered Mar 03 '24 at 22:27

What you describe as the "simplest approach" (sum of all ranks) is effectively the Borda count - it is indeed simple to implement and easy to explain, good start (+1).

In general, the setup presented aligns well with the concept of ranked voting, a system where "voters indicate a rank to order candidates or options—in a sequence from first, second, third, and onwards—on their ballots." (From the Wikipedia article linked) And thus, we want to explore two main research areas: 1. positional voting and 2. rank fusion methods. Both fields are niche but reasonably well-formulated research areas and have clear overlap. Positional voting is (expectedly) related to Political Science and Rank fusion to Information Retrieval. At first instance, I would also suggest looking at the concept of Condorcet fusion that tries to capture which choice is preferred over all other choices in a pairwise comparison. Following that, we can then go crazy with techniques like: CombSUM, CombMNZ, CombANZ, Rank-Biased Precision, Reciprocal Rank Fusion (RRF), the list goes on. Looking at particular voting system allows us to explore which properties our "ranking combination method" will have. e.g. Be Condorcet-consistent? Be median-voter-consistent? Run in polynomial time? (Some methods are NP-hard) Be independent of irrelevant alternatives? etc. etc.

To start with: Given our initial Borda count method, we should see which properties are relevant for us and are satisfied (and which not). After that, we can move forward to other voting/rank combination methods that are guaranteed to satisfy our selection criteria.

Thanks for the "ranked voting" keyword. Voting is one of the applications I had in mind, but I had no idea there were so many tradeoffs and paradoxes to consider. — akvilas, Mar 06 '24 at 10:07

score 5 · Answer 2 · answered Mar 03 '24 at 21:00

I don't see any reason to use the harmonic or geometric mean to average ranks. The geometric mean is usually used for rates, and the harmonic mean for values that are on very different scales. There's also median and the trimmed mean. I think a trimmed mean might be good, if you want to exclude judges who are very different from the other judges.

The underlying distribution of ranks (if you don't allow ties or skips) is going to be a discrete uniform from 1 to the number of items.

However, which alternative is "best" depends on what you want.

Galen · Answer 3 · 2024-03-06T00:53:33.763

Introduction

Let's consider a less direct approach that eventually tries to focus on order without trying to average the ranks in some sense.

Many aggregation functions, although not all, are non-monotonic. Some, such as summation or scalar multiplication, are permutation invariant. See Just Plugging in Ranks for a couple of examples. But there may be information about order and the preferences of individual judges that you may be throwing away. So let's consider another approach.

I alluded to a similar approach as below in Beyond Studying Order With Ranks.

Starter Kit for a Bayesian Model

Suppose $D=(V, E(V))$ is a random directed graph where each edge is a random indicator variable

$$E_{ik}=\mathbb{1}\left[ (i,k) \in \text{E}(V) \right].$$

The ranking in the data encode an ordering. For $r_{ij} \leq r_{kj}$ we can assign that to be an observed edge.

Developing a Bayesian model, we can assign to follow a Bernoulli likelihood:

$$E_{ik} \sim \text{Bernoulli}(p_{ik})$$

where

$$\text{logit} ( p_{ik} ) = \sum_{j=1}^m \beta_{ikj}\mathbb{1}(\text{Judge}=j) + \gamma_{ik}$$

and

$$\beta_{ikj} \sim \mathcal{N}(0,1)$$

$$\gamma_{ik} \sim \mathcal{N}(0,1)$$

You could potentially drop one of the judges if multi-collinearity is a problem.

From here you can go NUTS with sampling from the posterior and follow Bayesian workflow if you encounter challenges. Once you have posterior samples of this relation you'll be ready for the next step.

Post-Sample Processing

Getting Back to Order

The next step is to deal with the fact that your sampled digraphs may not have edge sets isomorphic to a partial order. This may be due to judges genuinely disagreeing with each other, leading to non-transitivity. But we can post-process our samples to obtain something of a "consensus in order" which I will make clear below.

Let's go from directed graphs to directed acyclic graphs in such a way that tries to preserve the ordering the arrows. This can be done with graph condensation, which can be visualized like this picture

where the blue-node digraph is a sample from our posterior and the yellow is the condensation of that sample.

Next we find the reachability relation, which is relevant because it preserves the order of the arcs, via the transitive closure. In effect we have taken a random relation and extracted the part of it that behaves like a random partial order.

Here are some implementations of these transformations:

Statistics on Partial Orders

These sampled partial orders can be graded, which provides a new set of ranks for groupings of items. Considering the preimage of the graph processing we've done, you can map grades back to the original items where items that shared a node in the condensation will have the same rank. These are in some sense "combined ranks" from the original ranks.

On each graded partial order you can compute the grade entropy. Section 3.5 of Seilis 2022 describes computing the grade entropy, which is just the entropy of the grades, as a way of quantifying how close the partial order is to being either a trivial order vs a total order. Doing this for each graded partial order gives a posterior distribution of grade entropies representing the uncertainty in the totality of the order. If the judges tend to closely agree with each other in the ordering then this will tend to be close to one. If there's complete disagreement then the grades will be concentrated near zero. And there's of course many possibilities in between.

Summary

By assuming that the ranks were reflective of some random ordering that depended on the judges, we can create a probabilistic model over random graphs.
The sampled graphs can be processed to extract partial orders reflecting what is ordered in the original graphs.
Various mathematical functions can tell you about how strongly ordered the judges preferences are.

Great (+1). This is a fabulous answer :) – camhsdoc Mar 25 '24 at 22:30 — camhsdoc, Mar 25 '24 at 22:30