The idea behind Cohen's $\kappa$ is to give context to the agreement rate by considering the agreement rate for a random annotater. This way, you do not make the mistake of regarding what seems to be a high agreement rate as actually being a high agreement rate when random annotaters would have almost as much agreement (perhaps even higher agreement).
From an answer of mine yesterday, Cohen's $\kappa$ is defined by:
$$
\kappa = (p_a - p_r) / (1 - p_r)
$$
In this notation, $p_a$ is the actual agreement proportion. For this task, that would be the number of times all of the annotaters agree divided by the number of annotations.
Then there is the $p_r$, which is the random agreement proportion. This has been worked out for two annotaters, with references given in the link. For three or more annotaters, this might have a closed-form solution (that is my suspicion), but even if it does not, you can use a simulation to get quite close.
The random annotaters sample with replacement from the true labels. You can do this hundreds or thousands of times in a loop, tracking how much agreement the random annotaters have in each iteration of the loop. Then take the mean agreement over all of the iterations, and that should be a good approximation of $p_r$, particularly as you do many iterations of the loop.
I give an example in R below.
set.seed(2023)
Define sample size
N <- 10000
Define the number of loop iterations
R <- 9999
Define some labels given by four annotaters
x1 <- rbinom(N, 4, 0.2)
x2 <- rbinom(N, 4, 0.2)
x3 <- rbinom(N, 4, 0.2)
x4 <- rbinom(N, 4, 0.2)
Determine the agreement between the four annotaters
p_a <- length(which(x1 == x2 & x2 == x3 & x3 == x4))/N
Loop R-many times to determine agreement for random annotations
random_agreement <- rep(NA, R)
for (i in 1:R){
Define random labels
x1_random <- sample(x1, length(x1), replace = T)
x2_random <- sample(x2, length(x2), replace = T)
x3_random <- sample(x3, length(x3), replace = T)
x4_random <- sample(x4, length(x4), replace = T)
Store the agreement between the four random annotaters
random_agreement[i] <- length(which(
x1_random == x2_random & x2_random == x3_random & x3_random == x4_random
))
}
Calculate the p_r as the mean of the random agreement values
p_r <- mean(random_agreement)/N
Calculate the Cohen-style agreement statistic
(p_a - p_r)/(1 - p_r) # I get 0.005398978
That agreement score of 0.005398978 indicates that the annotating is only slightly better than random. Given that the labeling indeed is random, this is not surprising. Contrast this with a situation where the labeling is not random.
set.seed(2023)
Define sample size
N <- 10000
Define the number of loop iterations
R <- 9999
Define some labels given by four annotaters
Create one starting set of labels (x1) and then add 1 to the labels with
varying probabilities to allow for disagreements
x1 <- rbinom(N, 4, 0.2)
x2 <- (x1 + rbinom(N, 1, 0.1)) %% 5 # 0.1 probability of disagreement from x1
x3 <- (x1 + rbinom(N, 1, 0.2)) %% 5 # 0.2 probability of disagreement from x1
x4 <- (x1 + rbinom(N, 1, 0.3)) %% 5 # 0.3 probability of disagreement from x1
Determine the agreement between the four annotaters
p_a <- length(which(x1 == x2 & x2 == x3 & x3 == x4))/N
Loop R-many times to determine agreement for random annotations
random_agreement <- rep(NA, R)
for (i in 1:R){
Define random labels
x1_random <- sample(x1, length(x1), replace = T)
x2_random <- sample(x2, length(x2), replace = T)
x3_random <- sample(x3, length(x3), replace = T)
x4_random <- sample(x4, length(x4), replace = T)
Store the agreement between the four random annotaters
random_agreement[i] <- length(which(
x1_random == x2_random & x2_random == x3_random & x3_random == x4_random
))
}
Calculate the p_r as the mean of the random agreement values
p_r <- mean(random_agreement)/N
Calculate the Cohen-style agreement statistic
(p_a - p_r)/(1 - p_r) # I get 0.483549
With the labels being created to have some agreement, the agreement score is much higher, at 0.483549. With that above modified to have even more agreement, that score gets even higher (0.9341579 in my particular simulation).
set.seed(2023)
Define sample size
N <- 10000
Define the number of loop iterations
R <- 9999
Define some labels given by four annotaters
Create one starting set of labels (x1) and then add 1 to the labels with
varying probabilities to allow for disagreements
x1 <- rbinom(N, 4, 0.2)
x2 <- (x1 + rbinom(N, 1, 0.01)) %% 5 # 0.01 probability of disagreement from x1
x3 <- (x1 + rbinom(N, 1, 0.02)) %% 5 # 0.02 probability of disagreement from x1
x4 <- (x1 + rbinom(N, 1, 0.03)) %% 5 # 0.03 probability of disagreement from x1
Determine the agreement between the four annotaters
p_a <- length(which(x1 == x2 & x2 == x3 & x3 == x4))/N
Loop R-many times to determine agreement for random annotations
random_agreement <- rep(NA, R)
for (i in 1:R){
Define random labels
x1_random <- sample(x1, length(x1), replace = T)
x2_random <- sample(x2, length(x2), replace = T)
x3_random <- sample(x3, length(x3), replace = T)
x4_random <- sample(x4, length(x4), replace = T)
Store the agreement between the four random annotaters
random_agreement[i] <- length(which(
x1_random == x2_random & x2_random == x3_random & x3_random == x4_random
))
}
Calculate the p_r as the mean of the random agreement values
p_r <- mean(random_agreement)/N
Calculate the Cohen-style agreement statistic
(p_a - p_r)/(1 - p_r) # I get 0.9341579
Following logic similar to the logic I use here, I would consider this to be the reduction in disagreement rate of your annotaters compared to the expected disagreement rate for random annotaters.
PROOF
$$
\kappa = \dfrac{p_a - p_r}{1 - p_r}
$$
Define $A$ as the total number of agreements in the annotations; $R$ as the expected number of agreements by random annotations; and $N$ as the total number of annotations by each annotater. Then $p_a = A/N$ and $p_r = R/N$.
Following the logic given here, the reduction in disagreement rate, compared to the expected disagreement rate given by random annotaters, is given by:
$$
1 - \dfrac{
\text{Disageement rate of the true annotations}
}{
\text{Expected disagreement rate of random annotations}
} =
1-\left(
\dfrac{
1 - p_a
}{
1 - p_r
}\right)
$$
Next...
$$
1 - p_a = 1 - \dfrac{A}{N} = \dfrac{N - A}{N}\\
1 - p_r = 1 - \dfrac{R}{N} = \dfrac{N - R}{N}
$$
Thus...
$$
\dfrac{
1 - p_a
}{
1 - p_r}
= \dfrac{
\dfrac{N - A}{N}
}{
\dfrac{N - R}{N}
} = \dfrac{N - A}{N - R}
$$
Thus...
$$
1 - \left(\dfrac{
1 - p_a
}{
1 - p_r}\right) \\=
1 - \left(
\dfrac{N - A}{N - R}
\right) \\= \left(\dfrac{N-R}{N-R}\right)-\left(\dfrac{N - A}{N-R}\right)\\ \\= \dfrac{
N - R - (N - A)}{N - R} \\= \dfrac{N - R - N + A}{N - R}\\=
\dfrac{A - R}{N - R}\\=
\dfrac{
\dfrac{
A - R
}{
N
}
}{
\dfrac{
N - R
}{
N
}
}\\=
\dfrac{
\dfrac{
A
}{
N
}-\dfrac{
R
}{
N
}
}{
1 - \dfrac{R}{N}
}\\=
\dfrac{p_a - p_r}{1 - p_r} \\
\square
$$
As this is the same interpretation as the usual Cohen's $\kappa$, if you are comfortable with guidelines for "acceptable" or "good" agreement levels of Cohen's $\kappa$, those might be a start, with the caveat that it will be harder and harder for random agreement to be high as the number of annotaters increases.