1

I run a human evaluation on a recruiting website. The situation is the following:

  • There are 78 annotators.
  • There are 100 questions.
  • Each annotator annotates 5 questions.
  • So, on average*, each question is annotated by 4 people. I say on average as the number of responses might vary.
  • Each question has 4 dimensions (grammaticality, meaning, simplicity, all), which are measured on a Likert 0 to 5 scale (-2 to 2 in case of the simplicity)

So, for example, looking at the grammaticality dimension, one has something like this:

Q1 Q2 Q3 Q4 ... Q500
A1 NaN 2 NaN NaN ... 3
A2 NaN NaN 0 NaN ... NaN
... ... ... ... ... ... ...
A78 1 2 3 1 ... 5

Obviously, if one does not consider the specific annotator, we can represent it like

AA AB AC AD AE
1 1 2 5 2
2 2 2 NaN 2
3 3 2 3 2
0 4 5 5 2
... ... ... ... ...

I am wondering if there is any inter-annotator agreement measure that can be applied in this case or if there is any other way to estimate the agreement.

PS: Is there a book/paper you would recommend to study IAA measures beyond the simple N "fixed" annotators annotate M examples with binary votes?

ImAUser
  • 83

0 Answers0