Any good metric for measuring multi-annotator agreement on an imbalanced dataset?

Question

Is there an agreement method that would be well-suited for a data annotation task where:

the labels are discrete classes
each datapoint belongs to exactly one class (multi-class classification)
each datapoint is annotated by 3 or more annotators
the labels are very imbalanced, with some labels occurring significantly more frequently than others.

Cohen's kappa is only for two annotators, so won't work here. Fleiss' kappa (allegedly) assumes that each annotator needs to assign a certain number of cases to each category, which is not the case here. Randolph's kappa seems to assume a uniform distribution of the classes instead, which is also not the case, and isn't very widely adopted.

Can anyone recommend a suitable metric to use? Or maybe one of the ones above is still applicable? If the recommended metric has recommended thresholds for "acceptable" or "good" agreement levels, that would be even better.

Many thanks!

Dave · Answer 1 · 2023-06-01T17:05:13.907

The idea behind Cohen's $\kappa$ is to give context to the agreement rate by considering the agreement rate for a random annotater. This way, you do not make the mistake of regarding what seems to be a high agreement rate as actually being a high agreement rate when random annotaters would have almost as much agreement (perhaps even higher agreement).

From an answer of mine yesterday, Cohen's $\kappa$ is defined by:

$$ \kappa = (p_a - p_r) / (1 - p_r) $$

In this notation, $p_a$ is the actual agreement proportion. For this task, that would be the number of times all of the annotaters agree divided by the number of annotations.

Then there is the $p_r$, which is the random agreement proportion. This has been worked out for two annotaters, with references given in the link. For three or more annotaters, this might have a closed-form solution (that is my suspicion), but even if it does not, you can use a simulation to get quite close.

The random annotaters sample with replacement from the true labels. You can do this hundreds or thousands of times in a loop, tracking how much agreement the random annotaters have in each iteration of the loop. Then take the mean agreement over all of the iterations, and that should be a good approximation of $p_r$, particularly as you do many iterations of the loop.

I give an example in R below.

set.seed(2023)
Define sample size

N <- 10000
Define the number of loop iterations

R <- 9999
Define some labels given by four annotaters

x1 <- rbinom(N, 4, 0.2)
x2 <- rbinom(N, 4, 0.2)
x3 <- rbinom(N, 4, 0.2)
x4 <- rbinom(N, 4, 0.2)
Determine the agreement between the four annotaters

p_a <- length(which(x1 == x2 & x2 == x3 & x3 == x4))/N
Loop R-many times to determine agreement for random annotations

random_agreement <- rep(NA, R)
for (i in 1:R){
Define random labels

x1_random <- sample(x1, length(x1), replace = T)
  x2_random <- sample(x2, length(x2), replace = T)
  x3_random <- sample(x3, length(x3), replace = T)
  x4_random <- sample(x4, length(x4), replace = T)
Store the agreement between the four random annotaters

random_agreement[i] <- length(which(
    x1_random == x2_random & x2_random == x3_random & x3_random == x4_random
  ))
}
Calculate the p_r as the mean of the random agreement values

p_r <- mean(random_agreement)/N
Calculate the Cohen-style agreement statistic

(p_a - p_r)/(1 - p_r) # I get 0.005398978

That agreement score of 0.005398978 indicates that the annotating is only slightly better than random. Given that the labeling indeed is random, this is not surprising. Contrast this with a situation where the labeling is not random.

set.seed(2023)
Define sample size

N <- 10000
Define the number of loop iterations

R <- 9999
Define some labels given by four annotaters
Create one starting set of labels (x1) and then add 1 to the labels with
varying probabilities to allow for disagreements

x1 <- rbinom(N, 4, 0.2)
x2 <- (x1 + rbinom(N, 1, 0.1)) %% 5 # 0.1 probability of disagreement from x1
x3 <- (x1 + rbinom(N, 1, 0.2)) %% 5 # 0.2 probability of disagreement from x1
x4 <- (x1 + rbinom(N, 1, 0.3)) %% 5 # 0.3 probability of disagreement from x1
Determine the agreement between the four annotaters

p_a <- length(which(x1 == x2 & x2 == x3 & x3 == x4))/N
Loop R-many times to determine agreement for random annotations

random_agreement <- rep(NA, R)
for (i in 1:R){
Define random labels

x1_random <- sample(x1, length(x1), replace = T)
  x2_random <- sample(x2, length(x2), replace = T)
  x3_random <- sample(x3, length(x3), replace = T)
  x4_random <- sample(x4, length(x4), replace = T)
Store the agreement between the four random annotaters

random_agreement[i] <- length(which(
    x1_random == x2_random & x2_random == x3_random & x3_random == x4_random
  ))
}
Calculate the p_r as the mean of the random agreement values

p_r <- mean(random_agreement)/N
Calculate the Cohen-style agreement statistic

(p_a - p_r)/(1 - p_r) # I get 0.483549

With the labels being created to have some agreement, the agreement score is much higher, at 0.483549. With that above modified to have even more agreement, that score gets even higher (0.9341579 in my particular simulation).

set.seed(2023)
Define sample size

N <- 10000
Define the number of loop iterations

R <- 9999
Define some labels given by four annotaters
Create one starting set of labels (x1) and then add 1 to the labels with
varying probabilities to allow for disagreements

x1 <- rbinom(N, 4, 0.2)
x2 <- (x1 + rbinom(N, 1, 0.01)) %% 5 # 0.01 probability of disagreement from x1
x3 <- (x1 + rbinom(N, 1, 0.02)) %% 5 # 0.02 probability of disagreement from x1
x4 <- (x1 + rbinom(N, 1, 0.03)) %% 5 # 0.03 probability of disagreement from x1
Determine the agreement between the four annotaters

p_a <- length(which(x1 == x2 & x2 == x3 & x3 == x4))/N
Loop R-many times to determine agreement for random annotations

random_agreement <- rep(NA, R)
for (i in 1:R){
Define random labels

x1_random <- sample(x1, length(x1), replace = T)
  x2_random <- sample(x2, length(x2), replace = T)
  x3_random <- sample(x3, length(x3), replace = T)
  x4_random <- sample(x4, length(x4), replace = T)
Store the agreement between the four random annotaters

random_agreement[i] <- length(which(
    x1_random == x2_random & x2_random == x3_random & x3_random == x4_random
  ))
}
Calculate the p_r as the mean of the random agreement values

p_r <- mean(random_agreement)/N
Calculate the Cohen-style agreement statistic

(p_a - p_r)/(1 - p_r) # I get 0.9341579

Following logic similar to the logic I use here, I would consider this to be the reduction in disagreement rate of your annotaters compared to the expected disagreement rate for random annotaters.

PROOF

$$ \kappa = \dfrac{p_a - p_r}{1 - p_r} $$

Define $A$ as the total number of agreements in the annotations; $R$ as the expected number of agreements by random annotations; and $N$ as the total number of annotations by each annotater. Then $p_a = A/N$ and $p_r = R/N$.

Following the logic given here, the reduction in disagreement rate, compared to the expected disagreement rate given by random annotaters, is given by:

$$ 1 - \dfrac{ \text{Disageement rate of the true annotations} }{ \text{Expected disagreement rate of random annotations} } = 1-\left( \dfrac{ 1 - p_a }{ 1 - p_r }\right) $$

Next...

$$ 1 - p_a = 1 - \dfrac{A}{N} = \dfrac{N - A}{N}\\ 1 - p_r = 1 - \dfrac{R}{N} = \dfrac{N - R}{N} $$

Thus...

$$ \dfrac{ 1 - p_a }{ 1 - p_r} = \dfrac{ \dfrac{N - A}{N} }{ \dfrac{N - R}{N} } = \dfrac{N - A}{N - R} $$

Thus...

$$ 1 - \left(\dfrac{ 1 - p_a }{ 1 - p_r}\right) \\= 1 - \left( \dfrac{N - A}{N - R} \right) \\= \left(\dfrac{N-R}{N-R}\right)-\left(\dfrac{N - A}{N-R}\right)\\ \\= \dfrac{ N - R - (N - A)}{N - R} \\= \dfrac{N - R - N + A}{N - R}\\= \dfrac{A - R}{N - R}\\= \dfrac{ \dfrac{ A - R }{ N } }{ \dfrac{ N - R }{ N } }\\= \dfrac{ \dfrac{ A }{ N }-\dfrac{ R }{ N } }{ 1 - \dfrac{R}{N} }\\= \dfrac{p_a - p_r}{1 - p_r} \\ \square $$

As this is the same interpretation as the usual Cohen's $\kappa$, if you are comfortable with guidelines for "acceptable" or "good" agreement levels of Cohen's $\kappa$, those might be a start, with the caveat that it will be harder and harder for random agreement to be high as the number of annotaters increases.

Any good metric for measuring multi-annotator agreement on an imbalanced dataset?

1 Answers1

Define sample size

Define the number of loop iterations

Define some labels given by four annotaters

Determine the agreement between the four annotaters

Loop R-many times to determine agreement for random annotations

Define random labels

Store the agreement between the four random annotaters

Calculate the p_r as the mean of the random agreement values

Calculate the Cohen-style agreement statistic

Define sample size

Define the number of loop iterations

Define some labels given by four annotaters

Create one starting set of labels (x1) and then add 1 to the labels with

varying probabilities to allow for disagreements

Determine the agreement between the four annotaters

Loop R-many times to determine agreement for random annotations

Define random labels

Store the agreement between the four random annotaters

Calculate the p_r as the mean of the random agreement values

Calculate the Cohen-style agreement statistic

Define sample size

Define the number of loop iterations

Define some labels given by four annotaters

Create one starting set of labels (x1) and then add 1 to the labels with

varying probabilities to allow for disagreements

Determine the agreement between the four annotaters

Loop R-many times to determine agreement for random annotations

Define random labels

Store the agreement between the four random annotaters

Calculate the p_r as the mean of the random agreement values

Calculate the Cohen-style agreement statistic